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The  lexical  access  component  of  the  CMTJ  continuous  speech 

recognition  system. 

Alexander  I.  Rudnlcky,  Zongge  LI,  and  Lynn  K.  Baumel  ter 

Department  of  Computer  Science,  Carneglr-Mellon  Unlveralty, 
Pittsburgh,  Pennsylvania  1S213. 


Abstract 

The  CMU  Lexical  Access  system  hypothesizes  wordo 
Irom  a  phonetic  lattice,  supplemented  by  a  course 
labelling  of  the  speech  signal  Word  hypotheses  are 
anchored  on  syllable  nuclei  and  are  generated  lndepen 
dently  for  different  parts  of  the  utterance.  Junctures 
between  words  are  resolved  separately,  on  d-mand  from 
the  Parser  module  The  lexical  representation  Is 
generated  by  rule  from  basefomu.  in  a  completely 
automatic  process  A  description  ol  the  various  com¬ 
ponents  of  the  system  is  provided,  as  well  as  perfor¬ 
mance  data 


Thb  paper  describes  the  lexical  access  system  under 
development  at  Camegle-Mellon  University.  The  design  of  the 
hypotheslzer  H  based  on  the  following  principles: 

•  Words  can  be  generated  bottom-up  with  a  veni  high 
degree  of  accuracy.  Given  a  sufficiently  accurate 
transcription  of  the  speech  signal.  It  is  possible  to  use  a 
completely  bottom  up  paradigm  to  drive  word  recog¬ 
nition.  without  assistance  from  higher-level  constraints, 
such  as  those  that  might  be  provided  by  a  narrowly 
defined  task,  or  restrictive  gramma- 

•  Multiple  knowledge  sources  are  n<tcessam  for 
generating  high-quality  word  h'jpotheses  The  Infor¬ 
mation  contained  in  a  phonetic  transcription  Is  of  I'self 
Insufficient  to  guarantee  high  accuracy,  additional  con¬ 
straints  on  Interpretation,  either  derived  from  alternate 
analyses  of  the  signal,  or  from  stored  knowledge  about 
speech  characteristics  are  necessary  for  accurate 
hypothesizing 

The  word  hypothestzer  produces  lexical  hypotheses  using 
the  phonetic  label  lattices  produced  by  the  Acoustic-Phonetic 
component  ol  the  svstem  |1|.  Figure  1  presents  a  schematic 
dlag  m  of  the  hypothestzer  module  Th*  principal  functional 
components  of  the  word  hvpotheslzer  are  the  following 

•  Matching  Engine:  The  matcher  generates  a  lattice  of 
word  hypotheses  A  modified  beam-search  algorithm  *s 
used  to  match  a  phonetic  transcription  against  a  lexicon 
stored  in  the  form  of  a  phonetic  network. 

•  Anchor  Generator:  The  matcher  does  not  attempt  to 
match  words  at  all  possible  positions  In  an  utterance  as 
might,  for  example,  a  two-level  DP  algorithm.  Rather,  the 
anchor  generator  uses  a  coarse  segmentation  of  the 
speech  wave  to  locate  syllable  nuclei  and  to  define  likely 
word  regions  t  anchors  '). 


•  Cor  ett  Labeller:  It  ts  capable  of  producing  a  robust  seg¬ 
mentation  of  the  speech  signal  into  silence,  noise  and 
vocal)'-  regions  Coarse  labels  are  used  both  to  locate 
sy  llable  nuclei  and  as  a  secondary  source  of  Information 
by  the  matcher. 

In  addition  to  the  above  components  the  lexical  access  svs¬ 
tem  also  includes  a  Phonetic  Lattice  Integrator  and  a 
Juncture  Verifier. 

The  Phonetic  Lattice  Integrator  combines  and  transforms 
the  Independently  generated  Information  contained  in  the 
stop,  closure,  vowel,  and  fricative  lattices  produced  by  the 
Acoustic- Phonetic  labelling  component.  The  actions  per¬ 
formed  by  the  Phonetic  Lattice  Integrator  Include  the  adlust 
ment  of  boundaries,  the  resegmentatton  of  overlapping  seg¬ 
ments,  and  the  combination  of  label  probabilities  from  dif¬ 
ferent  lattices 

The  role  of  the  Verifier  Is  to  process  word-juncture  verifica¬ 
tion  requests  generated  at  the  Parser  level.  The  Verifier 


Figure  1:  Word  Hvpotheslzer  svstem  diagram 


analyses  Junctures  between  words  and  Indicates  to  the  Parser 
whether  the  words  in  question  can  form  a  phonetlcallv  accept¬ 
able  sequence 


1  Matching  Engine 

Words  are  hypothesized  by  matching  an  Input  sequence  of 
labels  against  a  stored  representation  of  possible  pronuncia¬ 
tions.  the  lexicon.  The  matching  algorithm  makes  use  ol  both 
a  phonetic  lattice  and  a  coarse  lattice  The  network  search 
algorithm  used  in  the  current  svstem  is  based  on  the  beam 
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Starch  algorithm,  but  has  been  substantially  modified  to  deal 
with  the  particular  demands  of  the  current  task. 

Beam-seareh  Is  a  modified  best-first  search  strategy  that 
extends  paths  with  scores  within  some  window  of  the  global 
best  score.  The  width  of  this  window  (the  "beam")  controls 
the  severity  of  pruning  applied  to  the  search.  The  principal 
difference  between  a  conventional  beam-search  (as  Imple¬ 
mented.  e  g,.  in  the  Harpy  system  (3|)  and  the  current  algo¬ 
rithm  is  the  ability  to  simultaneously  search  paths  of  different 
lengths  Although  search  tree  is  expanded  segment  by  seg 
ment  (l.e..  Is  time-locked),  paths  may  begin  at  a  number  of 
separate  locations  In  the  anchor  region  (see  below).  Because 
of  the  resulting  differences  in  path  lengths,  the  bounds  of  the 
beam  cannot  be  calculated  In  a  simple  fashion.  The  solution 
used  is  to  normalize  all  path  scores  by  their  durauon 

The  size  of  the  search  tree  Ls  controlled  in  two  wavs,  by 
modifying  the  width  of  the  beam  and  by  altering  the  score  of  a 

glv  ;n  path  through  the  use  of  penalties. 

Beam  width  ls  calculated  dynamically  at  each  ply  and  ls 
based  on  a  pre-set  width  modified  by  a  value  based  on  the 
size  of  the  expansion  stack  generated  at  the  preceding  ply. 

The  effect  ls  to  relax  pruning  when  there  are  ft :w  nodes  on  the 
stack  and  to  tighten  It  when  the  stack  begins  to  grow  exces¬ 
sively  large  One  practical  consequence  of  this  ls  to  allow 
paths  that  Initially  have  poor  scores  to  survive  long  enough  to 
accrete  positive  evidence.  Another  consequence  ls  to  permit 
more  severe  pruning  later  in  the  search  when  the  number  of 
path  ls  typically  the  greatest  Dynamic  beam  adjustment 
speeds  the  algorithm  up  by  39%.  and  reduces  the  depth  of  the 
output  lattice  by  18%.  while  maintaining  match  accuracy. 

In  addition  to  pruning  based  on  beam  width,  the  system 
uses  several  other  strategies  to  control  the  size  of  the  search 
tree  Since  search  progresses  uniformly  through  successive 
segments,  paths  that  pass  through  thi  same  node  In  the  net¬ 
work  at  the  same  segment  (“collisions")  are  compared,  and 
only  the  best  path  ls  kept  (work  with  harpy  has  shown  that 
although  this  ls  a  sub-optimal  strategy,  it  nevertheless.  In 
practice,  produces  near-optimal  network  traversals,  at  sub¬ 
stantial  savings  In  search  effort). 

Two  additional  pruning  factors  come  Into  play  through 
their  ability  to  modify  the  cost  of  a  path  and  thereby  place  it 
outside  the  search  beam. 

The  first  of  these  ls  a  duration  range  associated  with  each 
phonetic  label  In  the  lexicon.  Paths  that  remain  in  a  par¬ 
ticular  state  (phone)  for  either  shorter  or  longer  than  the 
characteristic  range  for  that  phone  Incur  a  p.nalty.  For  ex¬ 
ample.  the  duration  range  for  a  /b  /  ls  (3  30|.  based  on  the 
observation  that  /b/  bursts  typically  do  not  exceed  20ms.  the 
constraint  for  an  /s/  ls  [50  250|.  again  based  on  tne  obser¬ 
vation  that  /s/  phones  are  typically  at  least  40ms  In  duration. 
Similarly,  the  duration  constraint  provides  a  dlffr rent  range 
for  an  /ax/  as  opposed  to  a  dlphtho:  g.  such  as  a>7.  Exiting 
a  state  either  too  early  or  too  late  Incurs  a  penalty,  this 
penalty  Is  added  to  the  path  score. 

A  second  type  of  penalty  ls  assessed  when  the  coarse  class 
of  a  phone  mismatches  that  provided  by  the  coarsf  labeller. 

Th e  assumption  here  ls  that  if  the  two  types  of  label  do  not 
match,  an  error  ls  likely.  Again,  the  penalty  added  to  the  path 
score  mak  is  It  a  candidate  for  pruning.  If  the  match  ls  al¬ 
ready  poor  this  penalty  hastens  its  pruning.  In  fact,  this 
p  malty  is  most  useful  for  rapidly  terminating  paths  that 
wander  across  category  boundaries,  for  example,  rmalnirig  in 
a  vocalic  state  when  the  segments  have  become  non-vocalic 
In  the  current  Implementation,  enforcing  cross-lattice  consis¬ 
tency  reduces  the  size  of  the  search  by  a  factor  of  about  3.  If 


consistency  were  absolutely  enforced  (l.e  ,  inconsistency 
results  in  Immediate  pruning)  search  would  be  reduced  by  a 
factor  of  6-7  though  with  a  los  in  accuracy 

The  calculation  of  the  path  score  ls  performed  according  *o 
the  following  formula: 
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The  formula  consists  of  three  terms:  the  phonetic  score,  the 
duration  penalties,  and  the  lattice  mlsniatch  penalties:  n  ls 
the  length  of  the  path.  The  phonetic  score  consists  of  the  fol¬ 
lowing:  di  is  a  segment  duration,  p.  ls  a  label  probability,  and 
k  ls  a  scaling  factor  (a  computational  convenience) 

The  duration  penalty  consists  of  a,,  the  amount  of  dis¬ 
crepancy.  and  PD,  a  ystem  parameter  controlling  the  degree 
of  penalty.  The  lattice  penalty  consists  of  a  system 
parameter.  PL,  scaled  b,  the  duration  of  the  segment.  d;.  Nor¬ 
malization  ls  necessarv.  as  paths  of  different  length  need  to  be 
comparable.  The  (Inal  term  in  the  equation  represents  a  state 
shortfall.  Each  hypothesis  In  the  lexicon  ls  required  to  match 
a  minimum  number  of  core  phonetic  states.  Matching  less 
than  this  number  Implies  that  word  has  been  severely 
reduced,  a  condition  which  ls  penalized  in  the  current  system. 


2  The  Lexicon 

The  lexicon  ls  stored  In  the  form  of  a  phonetic  network. 

The  process  of  creating  a  net  ls  as  follows-  For  the  chosen 
vocabulary,  a  set  of  base-form  pronunciations  ls  obtained. 

The  sources  of  pronunciations  that  have  been  made  use  of  In¬ 
clude  the  following:  lookup  in  an  on-line  phonetic  dictionary, 
such  as  the  Shoup  dictionary,  the  generation  of  pronuncia¬ 
tions  using  a  letter-to-sound  compiler  (the  MTTalk  system),  or 
direct  construction.  Each  approach  has  Its  advantages  and 
disadvantages.  We  have  found  that  automatic  generation  as  a 
first  pass,  followed  by  hand  correction,  generally  produces  the 
most  acceptable  result  and  does  so  In  a  reasonable  amount  of 
time.  Baseforms  are  further  expanded  into  pronunciation 
networks  In  order  to  take  Into  account  different  possible 
realizations  of  a  word,  such  as  those  due  to  rapid-speech 
phenomena  and  coartlculatory  effects.  Possible  variations  in 
pronunciation  are  expressed  in  the  form  of  phonological  rules 
that  are  applied  automatically  (In  an  off-line  procedure)  to  the 
baseform  pronunciation.  Figure  2  shows  a  typical  rule, 
governing  /ty/  dcsyllablflcatlon.  The  rule-appller  scans  the 
pronunciation  string  for  the  pattern  specified  In  the  FIF  por¬ 
tion  of  the  rule,  binding  the  elements  of  the  pattern  as 
specified.  Terms  headed  by  a  "+”  match  0  or  more  elements, 
which  are  bound  to  the  following  variable  (e.g..  LeftContext). 
Terms  headed  by  ">"  must  match  a  single  element,  typically 
meeting  the  constraints  specified  tr  the  remainder  of  the 
clause,  constraints  are  express  -d  In  terms  of  phonetic  fea¬ 
tures.  such  as  CONS  (consonant)  or  velar  (a  .'lace  of 
articulation).  The  then  part  of  the  rule  has  “wo  clauses,  the 
first  specifies  the  portion  of  the  pronunciation  string  to  be 
emitted,  the  second  clause  the  portion  to  be  rescanned  with 
the  pattern.  Depending  on  what  '*  nut  into  each  clause,  a 
rule  may  be  made  to  apply  once,  .nultlplr  times,  or  iteratively 
to  a  pronunciation.  The  current  CMU  lexicon  ls  constructed 
using  a  base  of  over  150  rules,  covering  several  tvprs  of 
phenomena.  Including  coartlculatorv  phenomena  and  front- 
end  characteristics.  A  small  number  of  additional  rules  per 
form  necessary  bookeeping  functions 


"a v.  a.  a.  .*  «*. a  a,  a.  -■  a  /  < 

A."-'  A^.V A."  A. 


Figure  2:  A  phonological  rule 

(iY-syl-loas-a 

(FIF  (  (+  LeftContext) 

(>  Tnu  (has  CONS)  ) 

(>  Tar  (has  VOWEL  HIGH  FRONT)  (lacks  LAX)) 
(>  Tpl  (hdj  VOWEL) ) 

(+  RigheConfex*-. ) 

) 

THEN 

(  (LefeConnexc 

Tml  (ait  ' Y  '  IY)  Tpl) 

(RightContext) ) 

) 

) 


The  above  rule  applied  to  the  word  Columbia: 

K  AX  L  UH  M  B  IY  AX 
K  AX  L  UH  M  B  (Y  ,  IY)  AX 


Expansion  Is  performed  by  adding  nodes  and  arcs  to  the 
base  pronunciation  through  the  application  of  phonological 
rules  The  Individual  nets  produced  In  this  fashion  are  then 
merged  together  Into  a  single  network,  the  representation 
used  by  the  matcher.  The  merge  collapses  common  Initial 
states  to  eliminate  redundant  niches  and  produces  a  net¬ 
work  that  fans  out  from  few  Initial  states  Into  a  larger  number 
of  states,  the  penultimate  states  corresponding  to  individual 
lexical  entries 


3  Anchor  Generation 

The  structure  of  speech  constrains  the  possible  locations  of 
words  in  an  utterance,  that  Is.  a  word  may  not  begin  or  end  at 
some  arbitrary  point:  permissible  end-points  are  governed  by 
the  acoustic  properties  of  the  signal.  To  eliminate  unneces¬ 
sary  matches,  the  system  uses  syllable  anchors  to  select  loca¬ 
tions  In  an  utterance  where  words  are  to  be  hypothesized. 

The  anchor  selection  algorithm  Is  straightforward  and  Is 
based  on  the  following  reasoning.  Words  are  composed  of 
syllables,  syllable  all  contain  a  vocalic  center  (de-volced  syll¬ 
ables  can  be  treated  as  a  special-case).  Word  divisions  cannot 
occur  inside  a  vocalic  center,  thus  all  syllable  and  word 
breaks  will  occur  in  the  regions  between  vocalic  centers.  The 
coarse  labeller  provides  information  about  vocalic,  non- 
vocalic.  and  silence  regions,  as  well  as  information  about 
energy  dips  within  vocalic  regions  (typically  corresponding  to 
liquids,  glides,  and  nasals).  This  allows  the  utterance  to  be 
segmented  Into  two  regions,  vocalic  centers  and  boundary 
regions  An  anchor,  as  used  by  the  matcher,  consists  of  two 
anchor  regions,  a  beginning  and  an  ending  one.  separated  by 
one  or  more  vocalic  centers,  the  number  of  centers  determin¬ 
ing  the  number  of  syllables  that  words  hypothesized  for  that 
region  should  have.  Figure  3  provides  a  schematic  diagram  of 
the  anchoring  process. 

The  matching  algorithm  allows  words  to  begin  anywhere  in 
the  beginning  region  (I  c,.  the  initial  state  of  the  network  is 
put  on  the  stack  for  each  phonetic  segment  In  this  region). 
Paths  may  not  tiansltlon  into  the  the  network's  final  state  un¬ 
til  path  ext  :nds  into  the  ending  region.  The  algorithm  Is  Im¬ 
plemented  In  such  a  tashlon  that,  for  a  given  word  In  the  lex¬ 
icon.  only  a  single,  "best”  hypothesis  will  be  generated,  where 
best  means  the  lowest  cost  traversal  through  the  lexical  net¬ 
work. 

Anchors  have  been  used  in  the  system  in  two  modes 
single-anchor  and  multiple-anchor  In  the  single-anchor  mode, 
anchors  of  different  lengths  are  generated  and  the  matcher  is 
Invoked  separately  for  each  one.  as  shown  In  (b) .  It  should 


be  apparent  that  this  procedure,  although  simple.  Is  In¬ 
efficient.  There  two  reasons  for  this:  The  entire  network  Is 
applied  to  each  anchor,  thus  time  is  wasted  trying  to  force, 
e.g.,  5  syllable  worda  Into  1-syUahle  anchors.  Second,  the 
•iame  region  of  speech  Is  scann  :d  repeatedly,  with  the  results 
of  one  scan  being  unavailable  to  subsequent  scans.  The 
multiple-anchor  strategy  alleviates  these  problems,  at  onlv  a 
slight  Increase  in  algorithm  complexity,  by  using  anchors  with 
multiple  end  regions.  In  this  case,  paths  for  words  of  in¬ 
herently  different  durations  can  terminat?'  at  compatible 
points  In  the  anchor  and  ar*  not  forced  into  inappropriate 
regions.  A  multiple-anchor  strategy  reduces  computation  by 
a  factor  of  3.  while  reducing  the  number  of  hypotheses 
generated  by  60%  (Inappropriate  mappings  of  words  into  syll¬ 
ables  are  eliminated).  A  third  strategy  Is.  possible,  though  at 
this  time  has  not  been  Implemented.  This  Is  the  use  of 
continuous  tmehors,  where  each  lnter-vocallc  region  serves 
both  as  an  entry  point  and  and  end-region  for  the  search  (d) 
The  advantage  of  a  continuous  anchor  strategy  is  that  it  al¬ 
lows  the  simultaneous  comparison  of  paths  that  span  dif¬ 
ferent  portions  of  the  signal.  The  quality  of  input,  however 
determines  the  success  of  this  strategy 


4  Coarse  Labeller 

The  coarse  labelling  algorithm  Is  based  on  the  zapdasii  al¬ 
gorithm  [2|,  modified  to  generate  additional  labels  and  to 
provide  a  more  accurate  segmentation  of  the  signal.  The 
coarse  labeller  codes  the  speech  signal  using  four  parameters 
extracted  on  a  centlsecond  basis,  these  being  peak-to-peak 
amplitude  and  zero-crossing  counts  for  low-passed  and  high- 
passed  portions  of  the  signal  (the  crossover  being  at  1  kHz). 
Segments  are  located  by  seeking  frames  characteristic  of  a 
particular  energy  type  using  a  strict  criterion  (an  "anchor”), 
then  expanding  these  Into  a  region  using  a  laxer  criterlor  In 
addition  to  the  anchor-extend  procedure,  rules  are  used  to 
apply  contextual  information  to  ambiguous  regions  and  to 
perform  boundary  adjustment. 

The  algorithm  currently  distinguishes  the  following  acous¬ 
tic  events:  silence,  including  “true”  silence  and  noisy  silence: 
sonorants.  Including  vocalic  centers  as  well  as  inter-vocalic 
sonorant  energy  dips  (such  as  nasals  or  liquids):  a  variety  of 
aperiodic  signals,  corresponding  to  fricatives,  aspirates,  etc. 

The  algorithm  Is  robust  and  speaker-independent,  and 
operates  reliably  over  a  large  dynamic  range.  Currently,  the 
quality  of  coarse  labelling  is  such  that  less  than  0. 1%  of  syl¬ 
labic  nuclei  are  missed.  A  number  of  extra  nuclei  are 
generated,  though  this  does  not  create  difficulties  for  either 
anchor  generation  or  lattice  cross-checking  during  matching. 


3  Phonetic  Lattice  Integrator 

The  phonetic  labels  produced  by  the  front-end  (1|  are 
grouped  into  four  separate  lattices-  vowels,  fricatives, 
closures,  and  stops.  Moreover,  labels  both  within  and  be¬ 
tween  lattices  may  overlap  in  time.  The  role  of  the  integrator 
Is  to  combine  these  separate  streams  and  produce  a  single 
lattice  consisting  of  non-overlapping  segments,  each  segment 
containing  the  information  from  one  or  more  segments  in  the 
original  lattices.  The  integrator  maps  the  label  space  used  by 
the  front-end  Into  the  label  space  used  in  the  lexicon.  For  ex 
ample,  the  label  “Stop”  is  expanded  into  the  appropriate  set  of 
lexical  labels  ( [pekbdg]).  In  addition,  the  integrator  uses  a 
confusion  matrix  to  partition  the  probability  isslgn»d  to  a 
front-end  label  into  sev-ral  labels  that  it  may  be  confused 
with,  thus  an  input  iy  label  will  be  reflected  in  not  omv  the 
lexical  :y  label  but  also  'he  ih  label. 


3 


Notes:  (a)  A  coara*  segmentation  of  the  speech  signal.  The 
hatched  blocks  arc  vocalic  centers,  (b)  Single  anchors  for  the  signal 
In  a  total  of  20  anchors.  Search  can  begin  from  any  •egment  in 
the  onset  region  must  proceed  through  the  middle,  and  can  ter¬ 
minate  in  the  coda  region,  (c)  multiple  anchors,  search  can  begin 
In  the  first  region  and  end  at  any  subsequent  region.  <d)  con¬ 
tinuous  anchors,  jearch  can  begin  In  any  but  the  last  region  and  end 
end  at  any  but  the  first  region. 

The  use  of  a  confusion  matrix  to  map  the  Input  symbol 
produces  an  improvement  in  accuracy,  but  at  the  cost  of  ad¬ 
ditional  search.  For  the  708  word  Shipping  Management 
task,  first  choice  accuracy  goes  from  32%  to  42%.  while  the 
average  number  of  states  examined  per  word  rises  2.5-fold, 
from  958  to  2381  We  believe  that  the  advantage  of  this 
transformation  is  due  to  the  ability  of  the  confusion  matrix  to 
capture  the  broad  behaviour  of  classifier  labels  across  dif¬ 
ferent  contexts  and  thereby  supplement  the  probabilities 
generated  for  a  given  classification  region  (see  |1|) 


6  Verifier 

Words  are  hypothesized  In  Isolation,  that  Is.  without  regard 
to  any  sequential  constraints  between  words.  In  this  sense, 
the  system  is  completely  bottom-up.  since  no  syntactic, 
s  imantic.  or  task  constraints  are  brought  to  bear  on  the 
proc.’ssof1-  •'otheslzation.  The  resulting  word  lattice  con¬ 
sequently  contains  many  potential  sequences  of  words.  The 
parser  [4|  attempts  to  construct  plausible  sequences,  but  does 
not  have  the  information  n  iressary  to  decide  whether  a  par¬ 
ticular  sequence  is  phonetically  acceptable.  The  Verifier  ex¬ 
amines  Junctures  between  words  and  determines  whether 
these  words  .an  be  connected  together  In  a  sequence.  The 
verifier  deals  with  t.iree  classes  of  Junctures:  abutments. 
where  two  words  Join  together  without  overlap  or  intervening 


segments;  gaps,  where  the  two  words  arc  separated,  and 
overlaps  where  the  words  share  one  or  more  jegmenta.  In 
general,  overlaps  that  Involve  inconsistent  Interpretations  of 
the  speech  signal  are  disallowed,  and  gaps  that  contain  sig¬ 
nificant  speech  ■’vents  are  also  disavowed.  Figure  4  shows 
the  distribution  of  Juncture  types  fo:  the  Email  task 
(considering  only  correct  word  sequences] .  together  with 
Verifier  accuracy. 


Figure  4:  Juncture  types 

and  Verifier  performance 

Juncture  type 

Incidence 

(rejection) 

Abuts 

51% 

(0%) 

Gaps 

20% 

(1.7%) 

Overlaps 

29% 

(5  9%) 

7  Performance 

System  performance  was  evaluated  by  calculating  the  rank 
of  the  correct  word  for  a  known  anchor  position.  This  metric 
is  somewhat  conservative,  since  words  with  the  same  core  but 
with  different  endpoints  are  compared  (for  jcample.  the  em¬ 
bedded  word  END  competes  with  the  word  SEND  under  the  cur¬ 
rent  scheme).  Figure  5  gives  performance  for  two  types  of  In¬ 
put  data,  spectrogram  reading  and  automatic  labelling  (using 
the  September  1986  CMU  system).  The  task  Is  the  324  word 
Electronic  Mall  task 


Figure  5:  Word  Matcher  performance 

Spectrogram  Automatic 

1st  choice  60%  32% 

Top  3  83%  55% 

Top  10  93%  76% 


8  Discussion 

The  CMU  lexical  access  system  operates  as  a  word-spotter, 
generating  all  likely  hypotheses,  anchored  on  syllable  nuclei. 
The  design  of  the  matching  algorithm  demonstrates  the  ap¬ 
propriateness  of  a  unified  matching  strategy,  as  opposed  to  a 
strategy  that  uses  coarse-flltering  of  word  candidates  followed 
by  fine-grain  phonetic  matching:  Coarse-class  constraints  are 
used  as  a  component  of  the  pruning  strategy  and  do  not  en¬ 
tail  the  use  of  hard  decisions  Implicit  In.  e.g..  a  filter  design. 
This  approach  provides  a  maximum  of  flexibility  to  sub¬ 
sequent  levels  of  processing 

Experience  with  the  anchor-based  matcher  has  revealed  a 
number  of  shortcomings  In  its  design.  For  example,  the 

benefits  of  anchoring  are  only  realized  when  syllables  arc  cor¬ 
rectly  detected.  Failure  to  Identify  a  syllable  boundary  can  be 
catastrophic— one  or  more  words  may  be  lost  as  a  result. 
Similarly,  the  word-spotting  mode  In  which  the  system 
operates  makes  It  difficult  to  make  use  of  constraints  that 
could  be  imposed  across  word  boundaries  and.  moreover, 
complicates  the  process  of  interpreting  Juncture  phenomena 
Given  these  findings,  we  have  begun  to  explore  a  different  ap¬ 
proach  to  werd  matching  The  new  algorithm  is  not  based  on 
anchoring  and  it  incorporates  explicit  modeling  of  Juncture 
phenomena.  We  refer  to  the  new  algorithm  as  a  rolling 
matcher,  as  it  "rolls"  through  an  utterance  rather  than  jump¬ 
ing  from  anchor  to  anchor. 
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In  order  to  avoid  the  compromises  made  In  the  lattice  In¬ 
tegration  step,  the  system  has  changed  to  use  input  in  the 
form  oi  a  phone  network,  A  phone  network  performs  the 
same  function  as  the  lattice  integrator — it  coordinates  the  out¬ 
put  of  the  four  dlfftrent  acoustic-phonetic  modules.  Specifi¬ 
cally.  It  provides  segment  boundarv  alignment  by  coercing 
segment  endpoints,  It  resolves  conflicting  overlap  conditions 
(l.e. .  by  providing  alternate  paths),  and  it  ensures  that  all 
regions  of  an  utterance  can  be  traversed  (l.e.,  by  labeling 
regions  not  labeled  by  any  of  the  primary  odules  according 
to  their  coarse -class  labels)  Another  benefit  of  a  phone  net¬ 
work  representation,  from  an  acoustic-phonetic  point  of  view, 
Is  that  It  allows  correct  handling  of  sequential  dependencies 
(e.g..  the  Influence  of  liquids  on  vowel  color). 

In  contrast  to  the  compilation  process  described  t earlier  In 
this  paper,  network  compilation  Is  now  performed  In  two 
sc  para  t  passes.  The  first  generates  lntra-word  variations, 
producing  sub-nets  for  each  bas  form  In  the  lexicon.  After 
these  sub-nets  are  merged  Into  a  single  net,  a  second  set  of 
rules  Is  applied  to  generate  correct  cross-word  connections, 
dealing  with  such  varied  phenomena  as  gemination,  Insertion 
(e.g.,  of  glides),  and  deletions  (eg.,  of  closures). 

The  matching  process  consists  of  "rolling"  the  (lexical)  net¬ 
work  through  the  phonetic  network.  Successful  paths 
through  the  lexical  network  (l.e..  trav-rsal  from  a  given  start 
node  to  a  given  end  node  produces  a  word  hypothesis  The 
word  hypothesis  Is  placed  on  the  output  lattice,  and  matching 
continues  on  to  all  words  that  can  legally  follow  the  word  that 
was  Just  completed. 
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Early  analyses  indicate  that  the  Rolling  matcher  differs 
from  the  Multiple-Anchor  matcher  In  several  respects-  The 
word  lattice  produced  by  the  Rolling  matcher  is  substantially 
denser  than  the  one  produced  by  the  multiple-anchor 
matcher  This  is  because  the  latter  produces  a  single  best 
match  for  a  given  region  of  speech,  the  former  produces  mul¬ 
tiple  matches,  with  different  end-polnts.  This  property  Is  ac¬ 
tually  desirable,  as  it  simplifies  thejuncture-valldatlon 
problem — multiple  end-polnts  allow  the  parser  to  select  the 
optimal  version  of  a  hypothesis,  without  the  need  for  detailed 
Juncture  analysis. 
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ABSTRACT 


This  paper  compares  the  reconnit'on  accuracy  obtained  in  forming 
sentence  hypotheses  using  several  parsers  based  on  different  types 
of  weak  statistical  model**  of  syntax  and  semantics  The  Inputs  fo 
the  parsers  were  word  hypotheses  generated  from  simulated 
acoustic-phonetic  labels  Grammatical  constraints  are  expressed  by 
trigram  models  of  sequences  ol  lex’cal  or  semantic  labels,  or  by  a 
finite-state  network  of  the  semantic  labels  When  the  input  to  the 
parser  is  ot  high  quality,  the  more  restrictive  trigram  models  were 
tound  to  perform  as  well  as  or  aetter  than  the-  llnite-slate  language 
model  The  more  restrictive  trigram  and  network  models  of  language 
produce  better  recognition  accuracy  when  all  correct  words  are 
actually  hypothesized,  but  strong  constraints  can  degrade 
performance  when  many  correct  words  are  missing  from  the  parser 
input. 


INTRODUCTION 


It  is  well  known  that  the  accuracy  ot  automatic  speech  recognition 
systems  can  be  greatly  improved  by  the  imposition  of  syntacic, 
semant'c,  and  grammatical  constraints  These  constraints  have 
typically  been  expressed  in  the  torm  of  finite-state  or  phrase- 
structure  network  models  [1 , 2, 3]  and  second  order  Markov  models 
(e.g.  [4]).  We  would  generally  expect  that  more  specific  domain- 
dependent  constraints  could  provide  a  greater  improvement  ot 
recognition  accuracy,  but  weaker  grammatical  constraints  may  prove 
advantageous  if  the  input  to  the  sentence  hypothesizer  is  noisy  or  if 
extragrammatcal  utterances  are  frequently  encountered. 


Carnegie  Melton  University  Is  presently  developing  a  largo- 
vocabulary  speaker-  independent  speech  recognition  system.  The 
system  Includes  a  feature-based  acoustic-phonetic  hypothesizer  (51, 
an  island-driven  word  hypothesizer  (6],  and  several  sentence  parsers 
that  convert  the  outputs  of  the  word  hypothesizer  into  sentence 
candidates. 


We  have  explored  several  schemes  tor  representing  syntactic  and 
semantic  knowledge  in  these  parsers,  including  case  trames  (7]  and 
simple  statistical  models  cf  sequences  ot  syntacto  and  semantic 
categories  ot  the  word  candidates.  Most  ot  the  statistical  grammars 
make  use  ot  a  second-order  Markov  model  to  represent  local 
syntaetto  and  semantic  phenomena  Our  work  differs  tram  most 
other  language  models  employing  this  Ingram*  representation  (e.  g 
[4])  in  that  constraints  are  expressed  in  terms  ot  probabilites  ot 
sequences  ot  lexical  or  semantic  labels  or  Tags',  rather  than  the 
individual  words  in  the  vocabulary  themselves.  In  addition  to 
providing  reasonable  accuracy,  we  also  believe  that  this  approach  s 
a  promising  way  to  reduce  the  amount  ot  storage  and  training 
required  to  effectively  model  word  usage  in  tasks  with  very  large 
vocabularies.  A  small  number  ol  other  groups  have  also  proposed 
trigram  models  using  a  reduced  number  ot  syntactic  or  semantic  tags 
but  these  groups  have  not  dicussed  'he  range  ot  'ar.guage  models 
and  input  conditions  that  will  be  cons  dered  here. 


The  purpose  of  this  paper  Is  fo  compare  the  ways  in  which  the 
H «qree  of  specifcity  of  the  grammatical  constraints  affect  the 
recognition  accuracy  obtained  with  a  deterministic  tinrte-state 
nelwork  representation  ot  the  task  and  with  some  of  the  probabilistic 
trigram  grammars,  considering  inputs  to  the  sentence  parsers  ol 
v  trying  quality. 


In  the  following  sections  we  first  describe  the  manipulations  ot  the 
input  fo  the  sentence  parsers  We  then  briefly  describe  the  different 
parsers  that  are  used  in  the  present  study.  Finally,  we  compare  the 
recognition  accuracy  of  these  parsers  In  the  presence  ot  the  different 
types  ot  degraded  input  and  comment  on  some  ot  the  implications  ot 
our  results. 


WORD  LATTICES  USED  IN  EXPERIMENTS 


For  each  sentence  presented  to  the  CMU  recognition  system,  the 
word  hypothesizer  outputs  a  large  number  of  candidate  words,  which 
are  each  characterized  by  a  begin  time,  an  end  time,  and  an 
acoustic-phonetic  plausibility  score.  This  set  of  annotated  word 
hypotheses  is  referred  to  as  the  “word  lattice"  of  the  input  sentence. 


The  expected  word  accur?**'  ot  sentences  produced  by  a  parser  Is 
closely  related  to  two  mapr  attributes  ot  the  word  lattice:  (1)  the 
relative  acoustic-phonetic  scores  of  the  correct  words  that  are 
present  on  the  lattice,  and  (2)  the  percentage  ot  correct  words  that 
are  missing  trom  the  lattice. 


In  order  to  compare  the  effects  of  degraded  word  quality  and 
omissions  trom  the  word  lattice,  we  prepared  six  sets  ot  lattices  in 
which  the  relative  scores  of  correct  words  and  percentages  of 
missing  words  were  artificially  manipulated.  The  characteristics  of 
these  sets  of  lattices,  which  were  used  in  our  performance 
calculations,  may  be  summarized  as  follows 

•  Original  lattices  -  48  sentences  trom  a  325-word  electronic 
mail  (email)  task  were  recorded  by  three  temaie  and  two  male 
speakers.  These  sentences  contained  a  total  of  281  words.  A 
set  of  acoustic-phonetic  labels  was  created  manually  by  expert 
spectrogram  readers  from  spectrograms  and  other  visual 
display :  of  the  digitlzeJ  wavetorms  This  labelling  was  'blind*  in 
that  the  labellers  did  not  know  the  identity  ot  the  correct 
utterance.  Since  these  lattices  nominally  represent  'ideal' 
output  trom  the  acoustic-phonetic  module,  they  are  useful  lor 
evaluating  degradations  in  recognition  performance  introduced 
by  the  system's  word  and  sentence  hypothesizes.  Word 
lattices  were  generated  from  the  acoustic-phonetic  -abels  in  the 
fashion  desenbed  in  (6]. 

•  High-quality  lattices  -  These  lattices  are  fhe  subset  ot  the  48 
original  blind-labelled  word  lattices  that  have  no  correct  words 
missing  and  no  incorrectly  penalized  word  tonctures  'see 
beow).  There  are  31  sentences  with  1 72  total  words  in  these 
lattices.  The  remaining  sets  of  word  lattices  were  obtained  by 
artificially  degrading  these  word  lattices 

•  Degraded-quality  lattices  -  Moderately-degraded  and 
severely-degraded  word  lattices  were  created  by  adding  a 
constant  to  the  avou  stir-phonetic  scores  of  words  ;n  'he 
hlgh-oua!lty  lattices  This  had  the  e*iect  of  worsening  trie 
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score?  ol  the  correct  words  relative  to  the  scores  ot  the 
Incorrect  words. 

•  Missing-word  lattices  •  Mtssing-word  lattices  were  created 
by  randomly  delating  either  10  or  25  percent  ot  the  correct 
words  trom  the  high -quality  lattice* 

The  overall  quality  ot  these  lattice  is  summarized  in  Figure  1  Each 
curve  ol  Figure  1  shows  how  many  words  in  the  lattice  per  correct 
word  need  be  examined  to  ensure  that  a  given  percentage  ol  correct 
words  Is  included  For  example  Figure  1  shows  that  the 
high-quality  lattices  and  the  moderately  and  (  ely-degraded 
lattices  contain  approximately  100  percent  ot  the  correc*  words  (it 
we  are  willirg  to  consider  a  sufficiently  large  r  jnriaer  ot  Incorrect 
words  as  well),  while  the  lattices  with  musing  words  contain  no 
more  than  70  and  85  percent  ol  the  correct  words,  no  how  many 
words  are  examined.  (The  asymptotes  In  these  three  curves  differ 
slightly  trom  their  nominal  values  because  ot  differences  In  the  word¬ 
boundary  criteria  used  by  the  hand  labellers  and  the  automatic  lattice 
evaluation  algorithms.) 


PARSERS  USING  TRIGRAMS  AND  NETWORKS 

We  compared  the  word  accuracy  ot  a  number  ot  different  lett-to-right 
parsers  in  processing  the  various  types  ot  word  lattices  described 
above.  These  parsers  make  use  ot  the  same  architecture,  differing 
only  In  the  types  ot  knowledge  used  to  evaluate  candidate  phrases. 
We  will  theretore  first  describe  the  overall  structure  and  then 
describe  each  Individual  parser  In  terms  ot  its  use  ot  syntactic  and 
semantic  knowledge 

As  noted  above,  each  parser  receives  a  lattice  ol  words  that  contains 
the  begin  time,  end  time,  and  a  score  tor  each  word.  It  forms  phrases 
trom  the  words  In  lett-to-right  tashion.  New  phrases  are  created  by 
attempting  to  add  new  words  to  the  end  ol  existing  phrases.  A  beam 
search  is  used  to  prune  the  sel  ot  phrases  retained  tor  further 
expansion  so  that  at  any  point  in  the  parsing  process  only  the  100 
best-scoring  phrases  are  retained. 

The  following  types  ol  knowledge  are  cons  dered  when  aoding  a  new 
word  to  a  candidate  phrase: 

•  Word  score  -  The  word  score  represents  the  likelihood  tor  the 
word  based  on  acoustic-phonetic  evidence,  provided  by  the 
word  hypothesizer 


Figure  1 :  Comparison  ol  quality  ol  word  lattices  used  in  the 
sentence  parsing  experiments. 


•  Word-juncture  quality  •  The  quality  ol  the  acoustic-phonetic 
juncture  between  two  words  is  scored  by  the  junction  verifier  In 
the  word  hypothesizer,  based  on  tables  ot  penalties  tor  overlaps 
and  gaps 

•  Syntactic  and  semantic  tntormatlon  -  Two  different  methods 
were  used  to  score  the  syntactic  or  semantic  plausibility  a 
tinite-stafe  network  derived  trom  a  tormal  description  ol  the 
grammar  ot  the  task  and  trigrams  ot  frequencies  ot  syntactic 
andror  semantic  word  classes. 

The  score  tor  a  phrase  is  a  linear  combination  ol  the  scores  provided 
by  each  ot  the  above  knowledge  souices.  The  weights  used  to 
combine  these  scores  were  determined  parametrically  trom  training 
data,  and  the  recognition  accuracy  ot  the  parsers  is  relatively 
insensitive  to  their  exact  value 

We  now  describe  the  various  parsers  in  more  detail. 

Allword  parser  This  parser  makes  use  only  ol  acoustic  scores  and 
word  juncture  Information  in  forming  its  phrase  hypotheses,  so  any 
word  can  follow  any  other  word. 

Trlgram  parsers.  The  trigram  measure  is  derived  trom  the 
conditional  probability  ot  observing  the  syntactic  or  semantic  classes 
ot  three  words  In  seauenct  in  a  set  ot  traininq  sentences,  which  in 
turn  Is  used  to  estimate  the  joint  probability  that  the  syntactic  or 
semantic  structure  ot  the  sequence  ot  three  words  is  correc*  The 
overall  utility  ot  this  approximation  depends  on  the  degree  ot  domain 
specificity  ot  the  training  sentences  and  syntactic  classes  used 

For  each  set  ot  syntactic  andor  semantic  constraints,  words  are 
sorted  into  categories  ol  one  or  more  tags.  Special  tags  are  used  to 
represent  the  beginning  and  ending  ot  a  sentence.  When  a  word  is 
added  to  the  end  ot  a  phrase,  it  is  assigned  a  trigram  score  based  on 
the  conditional  probability  ot  observing  its  tag  given  the  two  previous 
tags  in  the  phrase 

We  examined  the  tollowing  Ingram  parsers,  which  are  identified  by 
the  types  ot  syntactic  and  semantic  knowledge  that  constrain  ther 
hypotheses 

•  Syntactic  trlgram  parser  In  addition  to  word  scores  and  word 
juncture  intormation,  this  parser  also  incorporates  syntactic 
information  through  trigrams  ot  sequences  ot  41  tags  denoting 
lexical  categories.  These  tags  were  a  subset  ot  the 
approximately  90  lexical  tags  adopted  by  the  compilers  ot  the 
Brown  corpus  (8).  They  include  expanded  desgnations  ol  pans 
ot  speech,  complete  conjugations  ot  some  important  verbs  such 
as  be,  do,  and  have,  etc. 

•  Augmented  trlgram  parser  -  This  parser  is  similar  to  the 
syntactic  parser,  except  that  a  set  ol  55  tags  is  used.  This  set  is 
somewhat  more  specific  to  the  email  task  than  the  tags  used  by 
the  compilers  ot  the  Brown  corpus  For  example,  different 
designations  are  used  tor  nouns  representing  people,  places 
and  things,  and  there  is  a  greater  number  ol  lags  tha'  designate 
classes  ot  prepositions.  Hence  these  tags  describe  a  modest 
amount  ot  semantic  knowledge  We  believe  mat  these  tags 
could  eventually  represent  the  syntax  ol  a  database-query 
system  tor  an  arbitrary  task  domain 

•  Semantic  trlgram  parser  -  The  semantc  Ingram  parser  is 
similar  to  the  syntactic  parser,  except  that  a  set  ot  92  tags  is 
used  that  corresponds  to  the  labels  ol  the  nodes  ol  the 
semantic  network  parser  described  below  These  tags  and 
their  trigram  probabilities,  are  much  more  domain  dependent 

Semantic  Network  Parser.  The  deterministic  semantic  network 
parser  is  derived  trom  a  description  ot  the  email  task  expressed  in 
the  'orm  ot  case  trames  and  simple  phrase  structure  rules  [9.  7] 
These  were  manually  combined  into  a  semantic  grammar  ot  about 
350  ai.es.  The  grammar  was  then  compiled  into  a  limte-st ate 
network  sim.lar  to  a  Harpy  network  [V,  (or  taster  processing  This 
type  ol  network  can  provide  a  semantic  interpretation  ot  the  nput 
utterance  as  well  as  mere  word  recognition.  There  were  92  d'Merent 
categories  ot  words  in  the  network,  retteding  the  semantic  scecilicitv 
ol  the  encoding.  The  grammar  encoding  was  tight  in  the  sense  mat 
only  grammatical  sentences  are  accepted 
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EXPERIMENTAL  RESULTS  AND  DISCUSSION 


Each  of  the  parsers  was  run  on  each  of  the  sets  of  lattices.  Results 
are  expressed  by  thu  percentage  ot  correct  words  detected  and  by 
the  percentage  ot  incorrect  words  inserted  by  the  parser.  (The 
insertion  percentage  In  this  paper  Is  defined  to  be  the  number  ot 
incorrect  words  found  divided  by  the  number  ot  words  uttered  Note 
that  substitution  errors  cause  both  a  decrease  in  the  detection 
percentage  and  an  Increase  In  the  insertion  percentage ) 


Allword 

Semantic 

Semantic 

Parser 

Trfcram 

Network 

DETECTION  PERCENTAGES: 

Parser 

Parser 

Original  lattices  56 

87 

83 

High-quality  iattlces  59 

93 

92 

INSERTION  PERCENTAGES: 

Original  lattices  49 

9 

18 

High-quality  lattices  45 

3 

7 

Table  1 :  Companion  of  word  detection  and  insertion 
percentages  ot  three  selected  parsers,  with  input  trom  the 
original  and  high-quality  lattices 


Table  1  compares  the  percentage  ot  correct  words  and  the 
percentage  of  word  insertions  tor  the  three  ot  the  parsers  using  the 
original  and  high-quality  lattices.  The  two  parsers  that  make  use 
ot  semantic  knowledge  perform  Significantly  better  than  the  allword 
parser.  While  the  tags  tor  the  semantic  network  parser  and  the 
semantic  trgram  parser  are  identical,  the  semantic  trigram  parser 
obtains  slightly  greater  recognition  accuracy  because  it  evaluates  the 
likelihood  ot  a  sequence.  The  semantic  network  parser  rejects  illegal 
sequences  ol  words  but  performs  no  reordenng  ot  legal  ones  These 
results  demonstrate  that  the  parsers  using  trigrams  with  semantic 
knowledge  can  equal  or  bettor  the  performance  ot  parsers  that 
employ  a  Unite-state  grammar 

Effect  of  the  Rank  of  Correct  Words 

Figure  2  shows  the  ettects  ot  reducing  the  rank  ot  correct  words 
when  all  correct  words  are  in  the  lattice.  As  the  quality  ot  the  lattices 
worsens,  all  parsers  produce  tewer  correct  words  In  their  best 
hypotheses.  The  application  ot  syntactic  and  semantic  constraints 
produces  improved  accuracy,  and  the  more  specific  the  constraints, 
the  greater  the  accuracy.  The  output  ot  the  allword  parser  is  the 
most  severely  affected  as  the  lattice  quality  worsens 

Effect  of  Missing  Words 

Parser  outputs  tor  sets  ot  lattices  with  missing  words  are  shown  in 
Figure  3,  and  the  results  exhibit  a  different  trend.  When  only  10 
percent  ot  the  words  are  missing,  the  constrained  parsers  that  use 
syntax  or  semantics  produce  greater  word  accuracy  than  the  allword 
parser  This  is  because  a  significant  number  ot  sentences  have  no 
missing  words  and  the  constraints  are  useful  in  parsing  these 
sentences.  Since  the  average  length  of  senterces  in  the  email  task 
is  live  words,  roughly  60  percent  of  the  sentences  have  no  missing 
words  when  10  percent  of  the  correct  words  are  mlsslrg  from  the 
word  lattice.  When  25  percent  of  the  corned  words  are  missing,  only 
about  24  percent  of  the  sentences  should  have  no  missing  words.  In 
this  case,  the  more  specific  'ags  produce  poor  performance.  The 
semantic  trigram  parser  proouces  worse  word  accuracy  than  the 
allword  parser,  while  the  use  of  the  more  general  tags  still  provides 
some  benefit  over  the  allword  parser.  The  more  specific  tags  are. 
the  better  they  are  able  to  differentiate  between  sequences  of  corred 
and  ncorred  words.  Parsers  with  more  specific  tags  are  more 
disrupted  by  missing  words,  however,  because  there  s  less  of  a 
chance  that  other  (incorrect)  words  that  are  present  could  produce 
an  acceptable  sequence  of  'ags  Hence,  'he  more  general  taqs  do 
not  provide  as  much  accuracy  when  all  words  are  present,  but  they 
still  may  provide  some  benefit  if  many  words  are  missing. 


•  High-quality  lattices 
A  Moderately-degraded  lattices 
■  Severely-degraded  lattices 

Figure  2:  Bled  of  the  rank  of  corned  words  on  recognition 
accuracy.  Filled  symbols  indicate  correct  word 
detedion  percentage;  open  symbols  indicate 
insertion  percentage.  Parsers  examined  (from  left 
to  nght)  are  the  allword,  syntactic  trigram, 
augmented  syntadic  trigram,  semantic  trgram, 
and  semantic  network. 


•  High-quality  lattices 
A  10%-misstng  lattices 
■  25%-misslng  lattices 


Figure  3:  Elled  of  the  rank  of  corred  words  on  presence  of 
missing  words.  Filled  symbols  indicate  corred 
word  detedbn  percentage,  open  symbols 
indicate  Insertion  percentage  Parser  labels  are 
as  in  Figure  2. 


Effect  of  Syntax  of  Training  Data 

We  also  performed  an  additional  expenment  to  examine  the 
dependence  ot  the  parser  that  used  the  syntactic  tags  Irom  the 
Brown  corpus  on  the  syntax  of  its  training  data.  This  was 
accomplished  by  es.  .ing  probabilities  of  the  digrams  ol  the 
syn'adie  trigram  parser  using  the  following  three  different  sets  of 
sentences  as  the  traininq  text: 

|.  171  examples  ot  email  sentences.  (50  ol  these  sentences  were 
jsed  as  the  test  set  in  all  experiments.) 

2  1 71  sentences  from  the  original  Brown  corpus.  (These  were 
examples  taken  from  articles  in  newspapers. I 

3.  342  sentences  obtained  by  combining  the  lirst  two  data  sets. 
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The  word  accuracy  ol  the  syntactic  tr’gram  parsers  trained  on  each 
of  the  above  sets  ol  example  sentences  Is  shown  in  Table  2.  For  all 
sets  ol  lattices,  parser  performance  Increased  as  the  training  set 
more  closely  resembled  the  email  sentences  that  the  parsers  were 
evaluated  on. 

It  is  not  hard  to  Imagine  why  the  word  accuracy  was  so  low  when  the 
parsers  were  trained  on  only  the  Brown  corpus  sentences  Almost 
all  ol  the  Brown  corpus  sentences  are  declarative,  and  the  first  word 
tends  to  be  an  art.cle,  adjective,  or  nour  The  email  sentences,  on 
the  other  hand,  are  all  Imperative  or  Interrogative  In  form,  and  they 
begin  with  a  verb,  verb  auxiliary,  or  w/Mype  adverb.  Since  the 


parsing  proceeds  In 

lelt-to-right 

tasnlon,  the 

first  word 

Brown 

Mixed 

Email 

Training 

Training 

Training 

DETECTION  PERCENTAGES: 

Original  lattices 

58 

77 

80 

Hlgh-qualfty  lattices 

61 

77 

83 

INSERTION  PERCENTAGES: 

Original  lattices 

46 

27 

22 

High-quality  lattices 

44 

24 

16 

Table  2:  Ettect  ot  the  syntax  ot  the  training  set  ot  the 
syntactic  trlgram  parser  on  word  detection  and  Insertion 
percentages. 


sentence  has  a  great  effect  on  how  the  rest  of  the  sentence  is 
parsed.  In  light  ot  the  protound  differences  between  the  syntactic 
forms  of  sentences  in  the  Brown  corpus  and  in  the  email  task,  the 
relatively  good  performance  ot  the  parser  when  trained  on  the 
combination  ot  the  two  databases  is  quite  encouraging.  We  believe 
this  may  indicate  that  reasonable  performance  may  be  obtained  trom 
a  completely  domain-independent  syntactic  parser,  provided  that  all 
syntactic  sentence  torms  are  included  in  the  training  database. 


SUMMARY 

We  compared  the  word  recognition  accuracy  obtained  using  several 
ditterent  types  of  lett-to-right  sentence  parsers.  For  the  325-word 
email  task,  we  tour>d  that  parsers  using  Ingram  representations  ot  a 
small  number  ot  lexical  or  semantic  tags  could  perform  as  well  as  or 
better  than  the  parser  using  a  tinite-state  grammar  Increasing  the 
specificity  of  the  Ingram  representation  for  a  particular  task  domain 
tended  to  improve  performance  when  the  correct  words  are  not 
among  the  very  best  word  candidates,  but  It  can  degrade 
performance  if  correct  words  are  missing  completely  trom  the  imput 
word  lattices  The  performance  of  the  syntactic  trigram  parser 
appeared  to  he  relatively  insensitive  to  the  specilic  contents  ot  its 
training  database,  provided  that  tht  training  set  Included  the 
sentence  torms  that  were  encountered  in  the  lest  sentences. 
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Abstract  --  Developing  accurate  and  robust  phonetic 
models  for  the  different  speech  sounds  is  a  major  challenge  for 
high  performance  continuous  speech  recognition.  In  this 
paper,  we  introduce  a  new  approach,  called  the  stochastic 
segment  model,  for  modelling  a  variable-length  phonetic 
segment  X.  an  /.-long  sequence  of  feature  vectors.  The 
stochastic  segment  model  consists  of  1)  time-warping  the 
variable-length  segment  X  into  a  fixed-length  segment  called 
a  resampled  segment,  and  2)  a  joint  density  function  of  the 
parameters  of  the  resampled  segment  Y,  which  in  this  work  is 
assumed  Gaussian.  In  this  paper,  we  describe  the  stochastic 
segment  model,  the  recognition  algorithm,  and  the  iterative 
training  algorithm  for  estimating  segment  models  from 
continuous  speech.  For  speaker-dependent  continuous  speech 
recognition,  the  segment  model  reduces  the  word  error  rate  by 
one  third  over  a  hidden  Markov  phonetic  model. 

I.  Introduction 

In  large  vocabulary  speech  recognition,  a  word  is 
frequently  modelled  a*  a  network  of  phonetic  mode’s.  That  is, 
the  word  is  modelled  acoustically  by  concatenating  phonetic 
acoustic  models  according  to  a  pronunciation  network  stored  in 
a  dictionarv  of  phonetic  spellings.  In  phoneme-based  speech 
recognition  systems,  it  is  not  necessary  for  the  speaker  to  train 
all  words  in  the  vocabulary;  only  the  phonetic  models  are 
trained.  Assuming  the  above  structure  for  a  speech  recognition 
system,  the  goal  of  this  work  is  to  look  for  an  improved 
approach  to  phonetic  modelling. 

Hidden  Markov  modelling  (HMM)  is  one  method  for 
probabilistic  modelling  of  the  acoustic  real-zation  of  a 
phoneme.  Although  the  HMM  approach  has  been  used 
successfully  [1, 2, 3|,  its  recognition  performance  is  not 
sufficiently  accurate  for  large  vocabulary  continuous  speech 
recognition.  We  propose  an  alternative  and  novel  approach, 
called  a  stochastic  segment  model,  with  the  goal  of  improving 
phonetic  modelling.  The  motivation  for  looking  at  speech  on  a 
segmental  level,  rather  than  on  a  frame  by-frame  basts  as  in 
HMM  or  dynamic  time  warping  (DTW),  is  that  we  cm  better 
capture  the  jpectral/temporal  relationship  over  the  duration  of  a 
phoneme.  Evidence  of  the  importance  of  spectral  correlation 
over  the  duration  of  a  segment  can  be  found  in  the  success  of 
segment-based  vocoding  systems  [4|. 

A  speech  "segment"  is  a  variable-length  sequence  of 
feature  vectors,  where  the  features  might  be,  for  example, 
cepstral  coefficients  The  stochastic  segment  model  is  defined 


on  a  fixed-length  representation  of  the  observed  segment, 
which  is  obtained  by  a  time-warping  (or  resampling) 
transformation.  The  stochastic  segment  model  is  a  multivariate 
Gaussian  density  function  for  the  resampled  representation  of  a 
segment.  The  recognition  algorithm  chooses  the  phoneme 
sequence  that  maximizes  a  match  score  on  the  resampled 
segments.  The  training  algorithm  iterates  between  two  steps; 
first,  the  maximum  probability  phonetic  segmentation  of  the 
input  speech  is  obtained,  then  maximum  likelihood  density 
estimates  of  the  segment  models  are  derived. 

The  paper  is  organized  as  follows.  Section  2  introduces 
the  segment  model.  Section  3  describes  the  segment-bared 
recognition  algorithm,  and  Section  4  describes  the  training 
algorithm.  Section  5  presents  experimental  results  for 
phoneme  and  word  recognition,  comparing  the  results  to  HMM 
recognition  results  for  the  same  tasks  Finallv,  Section  6 
contains  a  brief  summary. 

2.  Stochastic  Segment  Model 

In  this  section,  we  define  the  stochastic  segment  model 
for  an  observed  sequence  of  speech  frames  X  -  [xt  x2 .  .  .  xL|, 
where  \j  is  a  i-dirnensional  feature  vector.  We  can  think  of 
this  observation  as  a  variable-length  realization  of  an 
underlying  fixed-length  spectral  trajectory  Y  =  [v,  y2  .  vm] 
where  the  duration  of  X  is  variable  due  to  variation  in  speaking 
rate.  Given  X,  we  define  the  fixed-length  representation 
Y  =  XTl  where  the  L  x  m  matrix  TL,  called  the  resampling 
transformation,  represents  a  time-warping  The  segment  Y, 
called  a  resampled  segment,  is  an  m-long  sequence  of 
^-dimensional  vectors  (or  a  k  x.  m  matrix)  The  stochastic 
segment  model  for  each  phoneme  a  is  based  on  the  resampled 
segment  Y  and  is  a  conditional  probability  density  function 
p(Yla).  The  density  p(Yla)  is  assumed  to  he  multivariate 
Gaussian  which  is  a  lm-dimensional  model  for  the  entire  lixed- 
length  segment  Y. 

Resampling  Transformations 

The  resampling  transformation  T,  is  an  L  .x  m  matrix  used 
to  transform  an  /.-length  observed  segment  X  into  an  m-leneth 
resampled  segment  Y  We  considered  several  different 
variable-  to  fixed-length  transformations,  concentrating  on 
transformations  which  had  previously  been  evaluated  in  the 
segment  vocoder  (4|.  The  best  recognition  results  are  obtained 
using  linear  time  sampling  without  interpolation.  Linear  time 
sampling  involves  choosing  m  uniformly  spaced  times  at  which 
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to  sample  the  segment  trajectory.  Sampling  without 
interpolation  refers  to  choosing  the  nearest  observation  in  time 
to  the  sample  point,  rather  than  interpolating  to  find  a  value  at 
the  sample  point. 


Figure  I:  Input  segment  (o)  and  corresponding 
resampled  segment  (x).  The  two  axes  correspond  to  two 
cepstral  coefficients. 


Figure  1  shows  an  input  segment  with  duration  six  in  two- 
dimensional  space  and  the  corresponding  resampled  Y  (with  m 
=  4)  using  linear  time  warping  without  interpolation.  The 
resampling  transformation  in  this  case  is: 
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Probabilistic  Model 


As  already  mentioned,  the  segment  model  is  a 
multivariate  Gaussian  based  on  the  resampled  segment  Y, 
p(Yla).  Recall  that  resampled  segments  are  Am-dimensional, 
where  k  is  the  number  of  spectral  features  per  sample  and  m  is 
the  number  of  samples  In  this  work,  typically  fc=14  and  m=10. 
Consequently,  the  segment  model  has  140  dimensions. 
Because  of  insufficient  training,  we  cannot  estimate  the  full 
phoneme-dependent  covariance  matrix,  so  we  must  make  some 
simplifying  assumptions  about  the  structure  of  the  problem. 
For  the  experiments  reported  here,  we  assume  that  the  m 
samples  of  the  resampled  segment  are  independent  of  each 
other,  which  gives  a  block  diagonal  covariance  structure  for  Y, 
where  each  block  in  the  segment  cova'iance  matrix 
corresponds  to  the  kx  k  covariance  of  a  sample.  The  log  of  the 
conditional  probability  of  a  segment  Y  given  phoneme  a  can 
then  be  expressed  as 

/n[p(Yla)]  =  ^Inlpfo/a)],  (1) 

;=l 

where  p^yjct)  is  a  k-dimensional  multivariate  Gaussian  model 
for  the  j-th  'ample  in  the  segment.  The  block-diagonal 
structure  saves  a  factor  of  m  in  storage  and  a  factor  of  m2  in 
computation  The  disadvantage  of  this  approach  is  that  the 
assumption  of  independence  is  not  valid,  particularly  if 
resampling  does  not  use  interpolation  where  adjacent  samples 


may  be  identical  In  the  future,  with  more  training  data,  we 
hope  to  relax  this  assumption.  It  is  likely  that  more  detailed 
probabilistic  models,  such  as  Gaussian  mixture  models  [5]  and 
context-dependent  (conditional)  models  [2,  3],  will  yield  better 
recognition  results  than  the  simple  Gaussian  model,  however, 
due  to  larger  training  requirements  we  did  not  pursue  these 
models  in  this  work. 

Properties  of  the  Segment  Model 

There  are  several  aspects  of  the  stochastic  segment  model 
which  are  useful  properties  for  a  speech  recognition  system. 
First,  the  transformation  T[_,  which  maps  the  variable-length 
observation  to  a  fixed-length  segment,  can  be  designed  to 
constrain  the  temporal  structure  of  a  phoneme  model  so  that  all 
portions  of  the  model  are  used  in  the  recognition.  We 
conjecture  that  the  fixed  transformation  will  provide  a  better 
model  of  phoneme  temporal/spectral  structure  than  either 
HMM  or  DTW.  Second,  the  segment  model  is  a  joint 
representation  of  the  phoneme,  so  the  model  can  capture 
correlation  structure  on  a  segmental  level.  In  HMM,  frames 
are  assumed  independent  given  the  state  sequence  In  the 
segment  model,  no  assumptions  of  independence  are 
necessary,  though  the  model  of  Y  given  by  Equation  1  is  based 
on  the  assumption  of  sample  independence  because  of  limited 
training  data  in  this  study.  The  model  is  potentially  more 
genera]  than  the  special  case  of  ( 1 )  Lastly,  by  using  a  segment 
model  we  can  compute  segment  level  features  for  phoneme 
recognition.  In  other  words,  the  segment  model  provides  a 
good  structure  for  incorporating  acoustic-phonetic  features  in  a 
statistical  (rather  than  rule-based)  recognition  system.  For 
example,  one  might  want  to  measure  and  incorporate  formant 
frequency  or  energy  differences  over  a  segment.  Section  5 
includes  results  where  sample  duration  is  used  as  a  feature, 
which  can  only  be  computed  given  the  length  of  the  entire 
segment 


3.  Recognition  Algorithm 

In  this  section,  we  describe  the  recognition  algorithm. 
First,  we  describeconsider  the  case  when  the  input  is 
phonetically  hand-segmented.  Then,  we  generalize  to 
automatic  recognition,  that  is,  joint  segmentation  ar.d 
recognition  of  continuous  speech. 

Wlvn  the  segmentation  of  the  input  is  known,  we 
consider  a  single  segment  X  independently  of  neignboring 
segments.  The  input  segment  X  is  resampled  as  segment  Y. 

The  recognition  algorithm  is  then  to  find  the  phoneme  a  that 
maximizes  p(Yla): 

A 

a  =  org  max/n(/i(\ ia)/7(a))  (2) 

n 

where  ln[p(Yla)|  is  given  by  Equation  I  This  decision  rule  is 
equivalent  to  a  maximum  a-posteriori  rule. 

In  an  automatic  recognition  system,  it  is  necessary  to  find 
the  segmentation  as  well  as  to  recognize  the  phonemes.  In  this 
caic,  we  hs^pothesize  all  possible  segmentations  of  the  input, 
and  for  each  hypothesized  segmentation  s  of  the  input  into  n 
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segments*  we  choose  the  sequence  of  phonemes  a  that 
maximizes: 

J(s)  *  V  I  L(i)  InlpiVfaWa,}  +  C)  (3) 

/■I 

where  L(i)  is  ihe  duration  of  the  i-th  segment,  Yj  is  the 
resampled  segment  corresponding  to  the  i-th  segment  in  s.  and 
<Xj  is  the  phoneme  that  maximizes  p(Yjla)p(a).  The  cost  C  is 
adjusted  to  control  the  segment  rate.  An  efficient  solution  to 
joint  segmentation  and  recognition  is  implemented  using  a 
dynamic  programming  algorithm.  Note  that  for  joint 
segmentation  and  recognition,  it  is  necessary  to  weight  the 
segment  probability  by  the  duration  of  the  segment,  so  that 
longer  segments  contribute  proportionate!'  higher  scores  to  the 
match  score  J(.)  of  the  whole  sequence. 

4.  Training  Algorithm 

In  this  section,  we  present  the  training  algorithm  for 
estimating  the  segment  models  from  continuous  speech.  We 
assMme  that  the  phonetic  transcription  of  the  training  data  is 
known  and  that  we  have  an  initial  Gaussian  model,  p0(Yla)  for 
all  phonemes  (Phonetic  transcriptions  can  be  generated 
automatically  from  the  word  sequence  that  corresponds  to  the 
speech  by  using  a  word  pronunciation  dictionary  )  We  assume 
that  the  phonetic  sequence  a  ha.  length  n.  The  algorithm 
comprises  two  steps:  automatic  segmentation  and  parameter 
estimation.  The  algorithm  maximizes  the  log  likelihood  of  the 
optima!  segmentation  for  the  phonetic  transcription,  where  the 
log  likelihood  of  a  segmentation  s  is  given  by. 

N 

l(s)=  2,  ln[p<\l'al)ria.l)]  (4) 

<-t 

where  Yj  is  the  resampled  segment  that  corresponds  to  the  i-th 
segment  in  the  segmentation  s  and  oq  is  the  i-th  phoneme  in  the 
sequence  a.  With  t  =  0,  the  iterative  algorithm  is  given  by: 

1.  Find  the  segmentation  s,  of  the  training  data  that 
maximizes  Ifs,)  for  the  given  transcription  and  the 
current  probability  densities  |p,(Yla)|. 

2.  Find  the  maximum  likelihood  estimate  for  the 
densities  |pR|(Yla)|  of  all  phonemes,  using  the 
segmentation  s,. 

3  t  <- 1  +  1  and  go  to  Step  1 

Both  steps  of  the  algorithm  are  guaranteed  to  increase  1(5,)  with 
t  If  there  are  at  least  two  different  observations  of  every 
phoneme,  then  the  probability  of  the  sequence  is  bounded. 
Hence,  the  iterative  training  algorithm  converges  to  a  local 
optimum.  Step  1  is  implemented  as  a  dynamic  programming 
search  whose  complexity  is  linear  with  the  number  of  phonetic 
models  N.  Step  2  is  the  usual  sample  mean  and  sample 
covariance  maximum  likelihood  estimates  for  Gaussian 
densities 

5.  Experimental  Results 

In  this  section  we  will  present  results  for  a  phoneme 
recognition  task,  as  well  as  word  recognition  results  for  a 


segment -based  recognition  system  and  an  HMM-bas*d  system. 
All  experiments  use  m  =  10  ample.,  per  segment  and  k  -  14 
mel-ffequency  cepstral  coefficients  per  sample.  These  values 
are  Fused  on  work  in  segment  quantization  (6],  and  limited 
experimentation  confirmed  that  these  values  represent  a 
reasonable  compromise  between  complexity  and  performance 
Speech  is  sampled  at  20  kHz,  and  analyzed  every  10  ms  with  a 
20  ms  Hamming  window. 

Phoneme  Recognition 

The  database  used  for  phoneme  recognition  is 
approximately  five  minutes  of  continuous  speech  from  a  single 
speaker.  The  test  set  contains  270  phonemes.  Both  the  test  set 
and  the  training  set  are  hand-labelled  and  segmented,  using  a 
61  symbol  phonetic  alphabet.  In  counting  errors,  an  'AX' 
(schwa)  recognized  as  'LX'  (fronted  schwa)  is  considered 
acceptably  correct,  as  is  an  'URT'  (unreleased  T)  recognized  as 
a  T\  All  recognition  rates  presented  represent  "acceptably 
correct''  recognition  rates.  The  acceptable  recognition  rate  is 
typically  6%  to  3%  higher  than  the  strictly  correct  recognition 
rate 

Phoneme  recognition  results  for  three  different  cases  ire 
given  in  Table  1.  The  results  illustrate  a  small  degradation  in 
performance  due  to  moving  from  recognition  based  on 
manually  segmented  data  to  automatic  recognition.  Using 
automatic  training  does  not  degrade  performance  any  further. 

We  also  experimented  with  using  an  additional  segmental 
feature  to  the  cepstral  parameters:  sample  duration  which 
requires  knowledge  of  the  hypothesized  duration  of  the 
segment.  Using  joint  segmentation  and  recognition  with  hand- 
segmented  training  data,  performance  improved  from  74  4%  to 
75  9%  as  a  result  of  using  the  duration  feature 


Training 

Segmentation 

Test 

Segmentation 

% 

Recognition 

*7o 

Insertion 

Manual 

Manual 

78.5 

0.0 

Manual 

Automatic 

74.4 

10.0 

Automatic 

Automatic 

73  7 

7.8 

Table  1:  Recognition  results  using  manually  segmented 
speech  and  automatically  segmented  speech. 


For  reference,  a  discrete  hidden  Markov  model  with  3 
states/phoneme  and  using  a  codebook  with  256  entries  has  62*7<i 
phonetic  recognition  rate  with  I2°t  insertions.  The  HMM 
recognition  performance  on  this  database  is  higher  when 
phoneme  models  ire  conditioned  on  left  context,  75G>  correct 
with  12*7-1  insertions  [2].  In  the  latter  case.  600  left-context 
phonetic  models  are  used  in  the  HMM  system  while  61 
phonetic  models  are  used  in  the  stochastic  segment  model 

Word  Recognition 

The  segment  based  word  recognition  system  consisis  of  a 
dictionary  of  phoneme  pronunciation  networks  and  a  colleciion 
of  segment  phoneme  models.  A  word  model  is  budt  by 


'*/*V>V*Y-VV-Y>7 .  .  -Y-Y-Y- 7  Y  7  Y-V*V*V-Y-V*V-  y-.-  ’  •'v'-I-'v'- 


concatenating  phoneme  models  according  to  the  pronunciation 
network.  The  recognition  algorithm  is  simply  a  dynamic 
programming  search  (Viterbi  decoding)  of  all  possible  word 
sequences.  For  the  results  in  this  paper,  we  assume  that  words 
are  independent  and  equally  probable;  there  is  no  grammar 
(statistical  or  deterministic)  associated  with  recognition. 
Within  each  word,  we  find  the  best  phoneme  segmentation  for 
that  word,  where  the  phoneme  sequence  is  constramed  by  the 
word  pronunciation  network. 


For  continuous  speech  word  recognition,  we  used  a  350 
word  vocabulary,  speaker-dependent  database  based  on  an 
electronic  mail  task  We  present  results  for  three  different 
male  speakers.  Fifteen  minutes  of  speech  was  used  for  training 
the  61  phoneme  models  for  each  speaker,  from  which  the  word 
model*  were  then  built.  An  additional  30  sentences  (187 
words)  are  used  for  recognition.  Analysis  parameters  arc  the 
same  as  for  the  previous  database  Again,  "acceptable"  error 
rates  are  reported  here,  where  in  this  case,  homophones  such  as 
"two"  and  "to"  are  considered  acceptable  errors.  Since  we  do 
not  use  a  grammar  homophones  are  indistinguishable 


The  initial  segment  models  are  obtained  on  training  from 
segmentations  given  by  a  discrete  hidden  Markov  model 
recognition  system.  The  results  after  one  pass  of  training  of 
the  segment  model  for  the  three  speakers  are  summarized  in 
Table  2.  The  HMM  recognition  results  are  also  given  for 
comparison.  For  the  HMM  results,  five  passes  of  the  forward- 
backward  training  algorithm  arc  performed  The  segment 
phoneme  system  outperforms  the  phoneme  based  HMM 
system,  reducing  the  error  rate  by  one  third  (including 
insertions).  However,  the  segment  phoneme  system  docs  not 
quite  match  the  HMM  context  model  system.  This  suggests 
that  context-dependent  segment  models  might  be  useful.  Note 
that  in  the  earlier  phoneme  results,  the  segment  system 
matched  the  performance  of  HMM  models  conditioned  on  left 
context  only.  Here  we  give  results  for  HMM  models 
conditioned  on  both  left  and  right  context.  The  HMM  system 
with  context  models  conditioned  on  both  left  and  right  context 
uses  2000  models,  or  thirty  times  the  number  used  by  the 
segment  system. 


Speaker 

Segment- 

PH 

HMM- 

PH 

HMM- 

PH-LE-R1 

RS 

87/5.3 

85/10.2 

90/1.1 

FK 

83/2  1 

ISIS  A 

88/2  7 

AW 

78/3  7 

68/7.5 

86/3  7 

Average 

83/3.7 

76/7.7 

88/2  5 

Table  2:  Word  recognition/insertion  ra'es  for  three 
speakers  for  the  segment  phoneme  system  and  for  two 
HMM  systems,  phoneme  models  and  phoneme  models 
conditioned  on  the  left  and  right  context 


6.  Conclusion 


To  summarize,  we  feel  that  the  segmeni  model  offers  the 
potential  for  large  improvements  in  speaker-dependent  acoustic 


modelling  of  phonemes  in  continuous  speech.  Our  initial 
results  demonstrate  the  potential  of  thr  approach  Of  course,  a 
practical  system  requires  automatic  training  and  recognition, 
which  we  demonstrated  to  perform  close  to  the  hand 
segmented  case  at  the  cost  of  a  few  insertions.  For 
comparison,  the  automatic  segment  system  reduces  the  word 
error  rate  by  one  third  over  an  HMM  system  on  a  350-word 
continuous  speech  recognition  task 
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Abstract 

This  piper  deals  with  rapid  speaker  adaptation  for  speech 
recognition.  We  introduce  a  new  algorithm  that  transforms 
hidden  Markov  models  of  speech  derived  from  one  "prototype" 
speaker  so  that  they  model  the  speech  of  a  new  speaker.  The 
speaker  normalization  is  accomplished  by  a  probabilistic 
spectral  mapping  from  one  speaker  to  another.  For  a  350  word 
task  with  a  grammar  and  using  only  15  seconds  of  speech  for 
normalization,  the  recognition  accuracy  is  97("a  averaged  over 
6  speakers.  This  accuracy  would  normally  require  over  5 
minutes  of  speaker  dependent  training  We  derive  die 
probabilistic  spectral  transformation  of  HMMs,  describe  an 
algorithm  to  estimate  the  transformation,  and  present 
recognition  results  , 

I.  Introduction 

We  have  previously  demonstrated  our  techniques  for 
robust  modeling  of  phonetic  coarticulation  for  large 
vocabulary,  continuous  speech  recognition  [1],  The  technique 
combines  detailed  context-dependent  phonetic  hidden  Markov 
models  (HMM)  with  robust  context-independent  models  to 
improve  word  recognition  accuracy  The  BBN  Speech 
Recognition  System  (BYBLOS)  integrates  many  components 
to  allow  accurate  speech  recognition  with  a  grammar  [2).  On  a 
350-word  continuous  speech  recognition  task,  the  word 
recognition  accuracy  was  90*%  with  no  grammar,  and 
98aE>-99r,£>  with  a  grammar  To  achieve  this  high  recognition 
accuracy,  each  speaker  read  300  training  sentences  or  about  15 
minutes  of  training  speech. 

Some  speech  recognition  applications  have  a  need  for  a 
new  speaker  to  beirin  using  the  system  with  reasonable 
accuracy  without  investing  a  long  time  to  train  the  system  on 
their  voice  However,  as  we  will  see  in  section  4,  the  speaker- 
dependent  performance  degrades  dramatically  when  the 
amount  of  framing  speech  is  reduced  using  the  standard 
training  procedure. 

The  approach  that  we  consider  in  this  paper  is  to 
normalize  well-trained  models  from  a  "prototype'  speaXer,  to 
model  the  speech  of  the  new  speaker  The  normalization 
requires  only  a  few  sentences  (referred  to  as  "normalization 
speech")  from  the  new  speaker 

In  Section  2.  we  derive  and  present  a  procedure  for 
estimating  a  probabilistic  spectral  mapping  from  one  speaker  to 


another.  Experiments  to  test  these  procedures  are  described  in 
Section  3.  The  results  of  the  experiments  are  analyzed  in 
Section  4. 

2.  Probabilistic  Mapping 

In  this  section  we  present  the  basts  for  the  probaoilistic 
transtormatton  and  show  it  to  be  equivalent  to  an  expinded 
HMM  model  for  each  state  of  the  original  HMM  The 
transformation  is  generalized  to  be  partially  dependent  on  the 
particular  phoneme.  Finally,  we  present  two  detailed 
algorithms  for  estimating  the  probabilistic  mapping 

Discrete  Hidden  Markov  Models 

For  each  state  of  a  discrete  HMM,  we  have  a  discrete 
probabdity  density  function  (pdf)  defined  over  a  fixed  set,  N, 
of  spectral  templates.  For  example,  in  the  BYBLOS  system  we 
typically  use  a  vector  quantization  (VQ)  codebook  of  size 
N- 256  [3].  The  index  of  the  closest  template  is  reterxed  to 
below  is  the  quantized  spectrum"  We  can  view  the  discrete 
pdf  for  each  state  a  as  a  probability  row  vector 

p(s)  =  (p(*|lj).  plkjs) .  /n*>/l4)|.  (D 

where  ^(k,li)  is  the  probability  of  spectral  template  k  at  state  r 
Mapping  From  Prototype  to  New  Speaker 

If  we  define  a  quantized  spectrum  for  the  prototype 
speaker  as  J t  ,  IS/SiV,  where  i  is  the  index  of  the  spectral 
template  and  a  quantized  spectrum  for  the  new  speaker  as 
k'j,  l£j£N,  then  we  denote  the  probability  that  the  new 
speaker  will  produce  quantized  spectrum  t' ,  given  that  the 
prototype  speaker  produced  spectrum  t,  as  p(k  It)  for  all  i.  j. 

We  can  rewrite  the  probability  for  spectrum  t  given  a 
particular  state  s  of  the  HMM  as 
N 

p(k'^s)  =  £  p{k\s)  pik’Jkjj)  (2) 

ini 

If  we  assumr  that  the  probability  of  k’  given  k  is 
independent  of  s.  then 

v 

p(k't Li)  =  £  pik}s)  ptk’\kt )  (3) 

it  I 

Tlie  set  of  probabilities  ptk'  for  all  i  and  /  form  an 
NxN  matrix,  T  that  can  be  thought  of  as  a  probabilistic 
transformation  from  one  speaker  s  spectral  space  to  another  s 
We  can  compute  the  discrete  pdf,  g'ls)  at  stale  s  for  the  new 
spciker  as  the  product  of  the  row  vector.  £<ji  and  'he  matrix. 
T. 

pis)  =  pis)  T;  T|;  *  p(k'tk.)  (4) 


Expanded _HMM  Formulation 

The  probabilistic  transformation  can  also  be  described  in 
terms  of  an  expanded  HMM  mod'l  for  the  state.  Figure  la 
shows  a  single  state  of  the  HMM  for  a  new  speaker  It 
contains  a  single  discrete  probability  vector,  p'(s).  Figure  lb 
shows  an  expanded  model  in  which  'he  single  state  is  replace 
hv  V  parallel  paths 

■) 


b) 


Figure  I :  Expanded  HMM  a)  single  state  of  the 
HMM;  b)  expanded  model  separating  prototype  pdt  and 
transformation  matrix. 


'.he  transition  probability  for  path  i  isp(*(.lj),  the  probability  of 
the  quantized  spectrum,  kt  given  the  same  slate  s  for  the 
prototype  speiker  The  discrete  pdf  on  that  path  is  pit'll;  ), 
which  corresponds  to  row  t  of  the  transformation  matrix 

'  areful  inspection  of  the  figure  will  reveal  that  the 
probability  of  any  new  speaker  spectrum.  for  the  expanded 
HMM  shown  is  a  summation  of  the  ;th  probability  over  all  N 
paths,  as  given  in  (3).  Therefore.  Figure  la  represents  the  left 
side  ol  equation  4,  while  figure  lb  represents  the  right  side 
Now  that  we  have  decomposed  each  pdf  for  the  new  speaker 
into  its  components,  we  can  use  the  forw mi-backward 
algorithm  to  estimate  the  transformation  matrix  while  keeping 
the  prototype  pdf  fixed.  Then,  once  the  matrix  has  been 
determined,  we  can  replace  the  expanded  HMM  by  the  single 
pdf  resulting  from  the  vector  mainx  multiplication  in  (4). 

Phoneme-Dependent  Transformation 

The  independence  assumption  in  (3)  above  assumes  that  a 
single  (probabuistic)  spectral  mapping  will  transform  the 
speech  of  one  speaker  to  that  of  another.  However,  we  know 
that  some  of  the  differences  between  speakers  cannot  be 
modeled  this  simply  We  can  define  a  phoneme  dependent 
mapping 

N 

p,k'ls)  -  Z  p{k's)  (5) 


where  <&(t)  specifies  the  equivalence  class  of  states  in  models 
a*  represent  the  same  phoneme  as  s  Since  the  amount  of 
naming  speech  from  the  new  speaker  will  be  small,  we  could 
not  hope  to  have  enough  samples  of  each  phoneme  i0  estimate 


a  reliable  mapping  for  all  phonemes  Therefore,  we  interpolate 
the  phoneme-dependent  transformation  matrix  with  the 
phoneme-md  pendent  transformation  matrix.  The  weight  for 
the  combination  depends  on  the  number  of  observed  frames  of 
the  particular  phoneme  Thus  for  those  phonemes  that  occur 
several  times  in  the  normalization  speech,  the  transformation 
will  depend  mostly  on  that  particular  phoneme. 

Detailed  Algorithm 

The  algorithm  begins  with  a  VQ  codebook  and  well- 
trained  context-dependent  and  context-independent  pdfs 
derived  from  a  prototype  speake'  A  small  numb-r  of 
sentences  are  read  by  the  new  speaker.  The  new 
(normalization)  speech  is  quantized  using  the  prototype 
speaker  s  VQ  codebook  (This  step  may  be  a  source  of 
reduced  performance,  and  will  be  discussed  further  in  Section 
4.)  Then,  we  use  a  modification  of  the  standard  forward- 
backward  algorithm  to  estimate  the  phoneme-dependent  and 
phoneme-independent  transformation  matrices. 

To  save  computation  and  storage  we  use  p’lsl,  the 
compact  HMM  in  Figure  la  to  compute  the  partial  <a  and  P) 
terms  in  the  forward  backward  algorithm.  The  forward 
backward  "counts"  .are  added  to  a  separate  count  matrix  (7  wo 
methods  for  computing  the  counts  are  defined  at  the  end  of  this 
subsection  )  Since  we  have  no  a  priori  transformation  matrix, 
we  must  provide  an  initial  estimate.  To  minimize  computation 
we  use  an  identity  matrix  for  the  first  transformation  (that  is, 
we  just  use  the  prototype  pdf  as  is)  However,  when  we 
compute  the  counts  in  the  first  pass,  the  transformation  matrix 
is  a  constant  value  of  1/A/  After  the  first  pass,  the  ,ame  matrix 
is  used  both  for  forward-backward  partial  terms  and  for 
computing  the  counts.  At  the  end  of  each  pass  through  the 
normalization  data,  each  row  of  the  count  matrix,  which 
corresponds  topU'U,),  the  transformation  given  one  prototype 
spectrum  is  rescaled  so  it  sums  to  1  This  normalized  count 
matrix  then  becomes  the  new  probabilistic  transformation 
matrix  After  the  final  pass  we  transform  all  the  prototype 
models  using  (4). 

Computing  Counts  -  Method  I 

For  each  alignment  of  a  state  with  an  observed  quantized 
spectrum,  k\t)=k'j,  the  prototype  pdf  vector  p(s).  is  multiplied 
by  column  j  of  the  transformation  matrix.  fHk'\kt)  tifSA' 
This  vector  product  is  multiplied  by  the  constants  1 ) 

and  P,(j)  (shown  in  Figure  lb)  and  then  accumulated  in  column 
j  the  count  matrix.  ot,_  ,(*-!>  is  the  probability  of  the 
observed  spectra  from  frames  1  through  r-1  given  the  models 
up  to  but  not  including  state  s  P,(s)  is  the  probability  of  the 
observed  spectra  from  the  end  of  the  sentence  hack  to  time  f-t-l 
given  the  models  after  state  s  This  method  corresponds  to  t  ie 
standard  (maximum  likelihood)  forward-backward  algorithm 
for  the  HMM  shown  m  Figure  lb. 

Computing  Counts  Method  2- 

Method  2  is  simitar  io  Method  I,  with  the  exception  that 
the  prototype  pdf  vector  is  multiplied  by  the  constants  a,(r)  and 
P,(s)  (shown  in  Figure  lb)  and  then  added  to  the  corresponding 
column  of  the  count  matrix.  That  is  the  counts  are  computed  as 
ihe  probability  of  being  in  state  s  at  time  r,  ’imes  the  prototype 
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pdf.  We  found  that  only  one  pass  of  the  algorithm  is  necessary 
for  Method  2,  making  it  preferable  in  term:,  of  computation. 
We  aLo  found  that  this  method  results  in  slightly  better 
pcrfotmance  than  Method  1  Therefore  all  results  quoted 
below  are  for  Method  2. 


vocabulary  size  (350).  The  grammar  used  had  a  Maximum 
Perplexity  [5|  of  30  and  an  estimated  Perplexity  (6)  of  20 
(measured  on  a  test  set).  The  recognized  sequence  of  words 
was  then  compared  automatically  to  the  correct  answer  to 
determine  the  percentage  of  errors  of  each  type  substitutions, 
deletions  and  insertions. 


3.  Experiments 


Database 

We  have  performed  experiments  on  a  350-word  subset  of 
a  naval  database  retrieval  task  (FCCBMP).  The  task  has  a 
fairly  rich  structure  and  allows  many  different  types  of 
questions  and  commands  The  prototype  speaker  recorded  400 
sentences  in  4  sessions  of  100  sentences  each,  separated  by  a 
few  days.  The  first  three  sessions  were  designated  as  training 
data,  and  the  last  as  test  material.  At  an  average  of  3  seconds 
per  sentence,  the  total  duration  of  the  training  material  was  thus 
about  15  minutes  for  the  prototype  speaker. 

Each  of  6  new  speakers  then  recorded  a  subset  of  the 
training  sentences  and,  in  a  separate  session,  the  100  test 
sentences.  The  6  speakers  included  one  female,  one  non-native 
speaker,  one  experienced  speaker,  and  three  inexperienced 
speakers. 

We  constructed  a  dinionary  of  phonetic  pronunciations 
for  the  vocabulary  without  listening  to  either  the  training  or  test 
material.  With  very  few  exceptions,  only  one  pronunciation 
was  chos»n  for  each  word. 

The  sentences  were  read  directly  into  a  close-talking 
microphone  in  a  natural  but  deliberate  style  in  a  quiet  office 
environment.  The  speech  was  lowpass  filtered  at  10  kHz  and 
sampled  at  20  kHz.  Fourteen  Mel-frequency  cepstral 
coefficients  (MFCC)  were  computed  every  10  ms  on  a  20  ms 
analysis  window.  One  half  of  the  training  speech  of  the 
prototype  speaker  was  used  to  derive  a  speaker-dependent  VQ 
codebook.  Then  all  the  recorded  speech  for  all  speakers  was 
quantized  using  this  codebook. 

Training 

The  15  minutes  of  speech  from  the  prototype  speaker  was 
used,  together  with  the  phonetic  dictionary  to  estimate 
context-dependent  and  context-independent  phonetic  models. 
The  speech  models  for  the  new  test  speakers  were  computed  in 
two  ways'  Speaker-Dependent  training  and  Speaker 
Normalization.  In  addition  to  these  two  models  for  the  new 
speaker,  we  also  performed  control  experiments  using  the 
prototype  speaker’s  models  without  any  change.  These 
unaltered  models  are  designated  "Cross -Speaker"  models. 
Prior  to  recognition,  the  phonetic  models  were  combined  and 
concatenated  into  word  models  to  facilitate  the  word 
recognition  process. 

Recognition 

We  used  the  time-synchronous  search  procedure 
described  in  (4]  to  find  the  most  likely  sequence  ot  words  for 
each  test  sentence  Recognition  experiments  were  performed 
both  with  and  without  a  grammar.  When  no  grammar  was 
used,  the  effective  branching  factor  was  equal  to  the 


4.  Results 


We  use  an  erTor  measure  that  reflects  all  three  types  of 
etTors  in  a  single  number.  The  percent  enor  is  given  by 

substitutions  +  deletions  +•  insertions 


%  error 


100 


total  words  +  insertions 


The  word  accuracy  is  then  defined  as  100  -  %error.  Note  that 
this  definition  is  different  from  the  percent  correct  words 

Figure  2  below  shows  the  recognition  erTor  as  a  function 
of  the  amount  of  training  ‘.pcech  (on  a  log  scale)  for  both 
training  conditions.  For  reference,  the  results  using  the  Cross 
Speaker  models  are  also  shown.  Some  of  the  conditions  that 
did  not  seem  to  warrant  extensive  testing  (e.g.,  15  second 
speaker-dependent  training)  were  evaluated  using  a  subset  of 
the  speakers.  More  critical  results  (e.g.,  15  second  speaker 
normalization)  were  evaluated  using  nil  6  speakers. 


The  recognition  error  varied  less  with  the  duration  of 
speech  for  speaker  normalization  than  for  speaker-dependent 
training  -  particularly  when  a  grammar  was  used.  The  enor 
rate  with  15  seconds  of  normalization  speech  was  about  the 
same  as  achieved  by  the  speaker-dependent  training  method 
with  6  to  10  minutes  of  training  speech.  In  particular,  when  a 
grammar  was  used,  the  word  recognition  enor  with  only  15 
seconds  of  normalization  speech  from  each  speaker  was  4 % 
(91°'o  correct  words  with  1%  insertions.) 
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Speaker-Dependent  Training  vs  Normalization. 
Speaker-Dependent  Training  (O);  Speaker  Normalization 
( □);  Cross-Speaker  Results  (A).  The  solid  line  indicates 
accuracy  wiih  a  grammar;  the  dashed  line  Jidicates  no 
grammar. 
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Detail  vs  Robustne  s 

We  can  see  from  the  results  with  and  without  a  grammar 
that  the  speaker  transformation  seems  to  be  much  more 
successful  when  a  grammar  is  used.  That  is,  the  enor 
decreased  by  a  bigger  factor  (from  speaker-dependent  training 
to  the  normalization  algorithm)  when  a  grammar  was  used  than 
when  no  grammar  was  used. 

When  no  grammar  is  used  in  speech  recognition  it  is 
important  that  the  models  be  sharply  tuned  to  make  fine 
distinctions.  Occassional  errors  will  result  from  a  finely  tuned 
model  that  was  inadequately  trained.  In  contrast,  we  assume 
that  when  a  grammar  is  used  the  number  of  words  allowed  at 
each  point  is  small  relative  to  the  vocabulary  size.  In  this  case 
it  is  less  likely  that  fine  phonetic  distinctions  will  be  necessary. 
To  get  very  high  performance,  it  becomes  more  important  that 
the  correct  word  never  get  a  very  low  score. 

We  have  observed  that  the  pdfs  resulting  from  the  speaker 
normalization  procedure  are  typically  broader  than  those 
resulting  from  speaker-dependent  training.  We  surmise  that 
this  effect,  combined  with  the  appropriate  spectral  mapping 
between  the  speakers,  accounts  for  the  large  improvement  in 
accuracy  when  a  grammar  is  used. 

Source  of  Errors 

We  performed  a  series  of  experiments  on  one  speaker  in 
an  effort  to  determine  whether  the  major  source  of  errors  is  the 
duration  of  normalization  of  speech,  the  normalization 
procedure  itself,  or  the  fact  that  the  VQ  codebook  of  the 
prototype  speaker  is  used  for  the  new  speaker.  We  present  the 
recognition  results  (using  no  grammar)  in  Table  1  below. 


Condition 

%  error 

15  min  spkr-dependent  training 

16% 

Prototype  VQ  codebook 

24% 

15  min  nonnalization 

27% 

5  min  nonnalization 

30% 

2  min  normalization 

33% 

Tahle  I:  Source  of  Recognition  Errors. 

Each  line  changes  one  experimental  condition. 


As  we  see  in  the  table,  the  largest  increase  in  word  error  is 
the  result  of  using  a  VQ  codebook  that  was  not  designed  for 
the  new  speaker.  Our  next  step,  therefore,  will  be  to  derive  a 
codebook  for  the  new  speaker  from  a  combination  of  the  new 
speech  and  the  prototype  speaker's  codebook.  This  expanded 
codebook  will  form  the  basis  for  the  normalized  pdf  models. 

5.  Summary 

We  have  presented  a  method  for  transforming  the  discrete 
HMM  models  of  one  speaker  so  that  they  are  appropriate  for  a 
second  speaker.  The  procedure  uses  a  small  amount  of  speech- 
to  estimate  a  probabilistic  spectral  mapping  from  a  well-trained 


prototype  speaker  to  a  new  speaker.  The  recognition  accuracy 
with  IS  seconds  of  normalization  speech  and  a  grammar  (tested 
on  a  set  of  6  diverse  speakers)  was  97 %  with  1%  word 
insertions. 

The  method  also  makes  the  HMM  models  more  robust, 
which  is  most  appropriate  when  a  grammar  is  used.  There  is 
some  evidence  that  the  speaker  normalization  performance 
suffers  because  we  use  the  prototype  speaker's  VQ  codebook 
for  the  new  speaker.  In  future  work  we  will  investigate 
speaker-adaptive  VQ  codebooks  for  speaker  normalization. 
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designing  a  large  and  complex  system  for  continuous  speech 
recognition.  This  paper  is  organized  as  follows.  Section  2 


Abstract 

In  this  paper,  we  describe  BYBLOS,  the  BBN  continuous 
speech  recognition  system.  The  system,  designed  for  targe 
vocabulary  applications,  integrates  acoustic,  phonetic,  lexical, 
and  linguistic  knowledge  sources  to  actveve  high  recognition 
performance.  The  basic  approach,  ar  described  in  previous 
papers  [1, 2|,  makes  extensive  use  of  robust  context-dependent 
models  of  phonetic  coarticulation  using  Hidden  Markov 
Models  (HMM).  We  describe  the  components  of  the  BYBLOS 
system,  including  signal  processing  frontend,  dictionary, 
phonetic  model  training  system,  word  model  generator, 
grammar  and  decoder.  In  recognition  experiments,  we 
demonstrate  consistently  high  word  recognition  performance 
on  continuous  speech  across:  speakers,  task  domains,  and 
grammars  of  varying  complexity.  In  speaker-dependent  mode, 
where  15  minutes  of  speech  is  required  for  training  to  a 
speaker,  98.5%  word  accuracy  has  been  achieved  in  continuous 
speech  for  a  350-word  task,  using  grammars  with  perplexity 
ranging  from  30  to  60.  With  only  15  seconds  of  training 
speech  we  demonstrate  performance  of  97%  using  a  grammar. 

1.  Introduction 

Speech  is  a  natural  and  convenient  form  of 
communication  between  man  and  machine.  The  speech  signal, 
however,  is  inherendy  variable  and  highly  encoded  Vast 
differences  occur  in  the  realizations  of  speech  units  related  to 
context,  style  of  speech,  dialect,  talker  Tms  makes  the  task  of 
large  vocabulary  continuous  speech  recognition  (CSR)  by 
machine  a  very  difficult  one.  Fortunately,  speech  is  also 
structured  and  redundant:  information  about  the  linguistic 
content  in  the  speech  signal  is  often  present  at  the  various 
linguistic  levels.  To  achieve  acceptable  performance,  the 
recognition  system  must  be  able  to  exploit  the  redundancy 
inherent  in  the  speech  signal  by  bringing  multiple  sources  of 
knowledge  to  hear  In  general,  these  can  include:  acoustic- 
phonetic,  phonological,  lexical,  syntactic,  semantic  and 
pragmatic  knowledge  sources  (KS).  In  addition  to  designing 
representations  for  these  KSs.  methodologies  must  be 
developed  for  interfacing  them  and  combining  them  into  a 
uniform  structure  An  effective  and  coherent  search  strategy 
ran  then  be  applied  based  on  global  .sion  criteria.  Practical 
issues  that  need  to  be  resolved  include  compuiaiion  and 
memory  requirements,  and  hc-v  could  be  traded  off  to 
obtain  the  desired  combination  of  speed  and  performance. 


gives  an  overview  of  the  BYBLOS  system.  Section  3 
describes  our  signal  processing  frontend.  Section  4  describes 
the  trainer  system  used  for  phonetic  model  knowledge 
acquisition.  Section  5  describes  the  word  model  generator 
module  that  compiles  word  HMMs  for  each  lexical  item. 
Section  6  describes  the  syntactic/grammatical  knowledge 
source  that  operates  on  a  set  of  context-free  rules  describing 
the  task  domain  to  produce  an  equivalent  finite  state  automaton 
used  in  the  recognizer.  Section  7  describes  the  BYBLOS 
recognition  decoder  using  combined  multiple  sources  of 
knowledge.  Finally,  Section  8  presents  some  figures  and 
discussions  on  BYBLOS  recognition  performance 

2.  Byblos  System  Overview 

Figure  1  is  a  block  diagram  of  the  BYBLOS  continuous 
speech  recognition  system.  We  show  the  different  modules 
and  knowledge  sources  that  comprise  the  complete  system,  the 
arrows  indicating  the  flow  of  module/KS  interactions.  The 
modules  are  represented  by  rectangular  boxes.  They  are. 
starting  from  the  top:  Trainer,  Word  Model  Generator,  and 
Decoder  Also  shown  are  the  knowledge  sources,  which  are 
represented  by  the  ellipses.  They  include:  Acoustic-Phonetic. 
Lexical,  and  Grammatic  knowledge  sources.  We  will  describe 
briefly  the  various  modules  and  how  they  interact  with  the 
various  KSs. 

Acoustic-Phonetic  KS 

The  Trainer  module  is  used  for  the  acquisition  of  the 
acoustic-phonetic  knowledge  source.  It  takes  as  input  a 
dictionary  and  training  speech  and  text,  and  produces  a 
database  of  context-dependent  HMMs  of  phonemes. 

Lexical  KS 

The  Word  Model  Generator  module  takes  as  inpul  the 
phonetic  models  database,  and  compiles  word  models  phonetic 
models.  It  uses  the  dictionary  -  the  lexical  KS.  in  which 
phonological  rules  of  English  are  used  to  represent  each  lexical 
item  in  terms  of  their  most  likely  phonetic  spellings.  The 
lexical  KS  imposes  phonoiactic  contraints  by  allowing  onlv 
legal  sequences  of  phonemes  to  be  hypothesized  in  the 
recognizer,  reducing  ihe  search  space  and  improves 
performance.  The  output  of  the  Word  Model  Generaior  is  a 
database  of  word  models  used  in  the  recognizer. 


In  BYBLOS.  we  have  explored  many  issues  thai  arise  in 


Grammatical  KS 

More  recently,  we  have  been  working  on  representation 
and  integration  of  higher  levels  of  knowledge  sources  into 
BYBLOS,  including  both  syntactic  and  semantic  KSs.  By 
incorporating  both  of  these  KSs  into  BYBLOS  in  the  fom  of  a 
grammar  into  our  recognizer,  we  demonstrate  improved 
recognition  performance.  In  Section  6,  we  describe  the 
Grammatical  KS  in  more  detail 


Speech  Text 


Figure  I:  BYBLOS  System  Diagram. 


3.  Signal  Processing  and  Analysis  Component 

The  BYBLOS  signal  processing  frontend  performs 
feature  extraction  for  the  acoustic  models  used  in  recognition. 
Sentences  are  read  directly  into  a  close  talking  microphone  in  a 
natural  but  deliberate  style  in  a  normal  office  environment. 
The  input  speech  is  lowpass  filtered  at  10  kHz  and  sampled  at 
20  kHz.  Fourteen  Mel-frequencv  cepstral  coefficients  (MFCC) 
are  computed  from  short-term  spectra  every  10  ms  using  a  20 
ms  analysis  window.  This  MFCC  feature  vector  is  then  vector 
quantized  to  an  G-bit  1256  bins)  representation.  The  vector 
quantization  (VQ)  codebook  is  computed  using  the  k-means 
clustering  algorithm  with  about  5  minutes  of  speech.  We 
perform  a  variable-frame -rate  (VFR)  compression  in  which 
strings  of  up  to  3  identical  vector  codes  are  compressed  to  a 
single  observation  code.  We  found  this  VFR  procedure  speeds 
up  computaiion  with  no  loss  in  performance. 


4.  Training/Acquisitimi  Of  Phonetic 
Ctnirtkulation  Models 

The  training  system  in  BYBLOS  acquires  and  estimates 
the  phonetic  coarticulaiion  models  used  in  recognition  Given 


that  we  model  speech  parameters  as  probabilistic  functions  of  a 
hidden  Markov  ehain,  we  make  use  of  the  Baum-Welch  (also 
known  as  the  Forward-Backward)  algorithm  [3]  to  estimate  the 
parameters  of  the  HMMs  automatically  from  spoken  speech 
and  corresponding  text  transcription.  For  each  training 
utterance,  the  training  svstem  takes  speech  and  text,  and  builds 
a  network  of  phonemes  using  the  dictionary.  It  firsi  builds  the 
'  metic  network  for  the  word  by  using  the  phonetic 
transcription  provided  by  the  dictionary  The  phonetic  network 
is  expanded  into  a  triphone  network  so  that  each  arc 
completely  defines  a  phonetic  context  up  to  the  triphone 
These  triphone  networks  of  the  word  are  then  concatenated  to 
form  a  single  network  for  the  sentence,  which  in  general  can 
take  into  account  within  word  as  well  as  across-word 
phonological  effects.  The  training  system  then  compiles  a  set 
of  phonetic  context  models  for  each  triphone  arc  in  the 
network.  It  then  runs  the  forward-backward  algorithm  to 
estimate  the  parameters  of  the  phonetic  comext  models.  The 
Trainer  operates  in  two  modes:  speaker-dependent  and 
speaker-adapted.  Associated  with  these  two  modes  are  two 
distinct  methodo  for  training  the  parameters  of  the  hidden 
Markov  models  described  below. 

Speaker- Dependent 

This  is  the  algorithm  used-  to  find  the  parameters  of  the 
HMMs  that  maximizes  the  probability  of  the  observed  data 
given  the  model.  This  method  produces  HMMs  that  are  finely 
tuned  to  a  particular  speaker,  therefore  in  general  would  work 
well  only  for  this  speaker.  Typically  about  15  minutes  of 
speech  from  a  speaker  is  required  for  speaker  dependent 
training. 

Speaker-Adapted 

This  is  a  new  method  of  training  that  transforms  HMM 
models  of  one  speaker  to  model  the  speech  of  a  second  speaker 
|4).  This  procedure  estimates  a  probabilistic  spectral  mapping 
from  a  well-trained  prototype  speaker  to  a  new  speaker  Using 
this  method  it  is  possible  for  a  new  speaker  to  used  the  system 
with  as  little  as  15  seconds  of  speech. 


5.  Word  Model  Generator 

Prior  to  recognition,  word  HMMs  are  computed  for  each 
word  in  the  vocaoulary  The  word  model  generator  takes  as 
input  two  objects:  a  database  of  phonetic  HMMs  as  obtained 
in  training,  and  a  dictionary  that  contains  phonetic  spellings  for 
each  word.  For  each  phoneme  in  each  word  of  the  lexicon,  it 
first  finds  in  the  phonetic  HMM  database  all  the  context 
models  that  are  relevant  to  this  phoneme  in  its  particular 
phonetic  environment.  It  then  combines  this  set  of  phonetic 
models  with  appropriate  weights  to  produce  a  single  HMM  tor 
each  phoneme  in  the  word  This  combination  process  saves 
computation  by  precompiling  the  manv  levels  of  phonetic 
context  models  that  can  occur  for  a  given  phonetic  context  into 
a  single  representation.  The  output  of  the  word  model 
generator  is  a  database  of  word  HMMs  serving  as  the  input  to 
the  decoder. 
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6.  Grammatical  Knowledge  Source 

To  solve  the  CSR  problem  requires  major  advances  in 
two  areas:  acoustic  modeling  and  language  modeling.  A  good 
acoustic  model  is  essential  in  making  fine  phonetic  distinctions 
when  needed.  However,  it  is  not  sufficient  by  itself  to  solve 
the  CSR  problem.  In  a  complex  task  with  large  vocabulary 
where  the  number  of  hypothesized  word  candidates  is  large, 
the  probability  for  acoustic  confiisability  can  be  high,  and  the 
recognizer  could  make  errors.  A  conceptually  simple  yet 
effective  way  to  restrict  tne  number  of  words  that  are  allowed 
to  be  hypothesized,  and  therefore  decrease  probability  of 
acoustic  similarity,  is  to  incorporate  a  grammar  into  the 
recognizer  It  is  well  known  that  recognition  performance 
improves  as  vocabulary  size  decreases.  Similarly,  when 
syntactical  information  is  used  to  reduce  the  number  of  words 
that  can  legally  follow  a  given  sequence  of  words,  a  recognizer 
is  expected  to  make  fewer  errors.  The  purpose  for  using  a 
grammar  then,  is  to  improve  recognition  performance,  with  an 
added  benefit  of  reducedcomputation. 

Grammar  Design  and  Implementation 

We  approach  the  implementation  of  a  grammar  in 
BYBLOS  in  two  stages.  First,  we  create  a  description  of  the 
task  domain  language  using  a  modified  context-free  notation. 
Typically  this  description  is  based  on  a  representative  se  of 
sentences  that  characterizes  the  task  domain,  and  is  designet  to 
capture  generalizations  of  the  linguistic  phenomena  found  n 
them.  Second,  we  use  a  tool  that  transforms  this  description 
into  structures  in  our  recognizer  that  provide  the  corresponding 
grammatical  constraints.  This  tool  provides  us  with  a  general 
facility  for  capturing  in  BYBLOS  an  approximation  of  any 
language  expressible  in  context-free  grammars  (CFG) 
expressed  as  context-free  rules.  We  elected  to  implement  the 
grammatical  constraints  in  the  form  of  a  finite  state  automaton 
(FA)  similar  to  those  described  in  [5]. 

At  the  first  stage  in  generating  a  grammar,  we  use  a 
context-free  notation  augumented  with  variables  in  order  to 
simplify  the  process  of  describing  a  language.  For  example, 
this  notation  would  allow  a  rule  that  says  a  noun  phrase  of  any 
number  can  be  replaced  by  an  article  and  a  noun  of  the  same 
number,  ordinary  context-free  notation  would  require  two  rules 
that  are  identical  except  that  one  would  be  for  singular  number 
and  the  other  for  plural. 

Our  system  first  translates  the  augmented  notation  into 
ordinary  CFGs  and  then  constructs  a  FA  based  on  these  rules. 
Because  context-free  grammars  can  accept  recursive  languages 
and  a  FA  cannot,  recursion  is  approximated  in  the  FA  by 
limiting  the  number  of  levels  of  r-cursion.  Such  an 
approximation  is  reasonable  for  most  cask  languages,  since 
spoken  sentences  do  not  ordinarily  use  more  than  a  few  levels 
of  recursion. 

7.  Recognition  Search  Strategy 

Once  the  FA  is  compiled  from  the  context-free 
description  of  the  task  domain,  it  is  ready  to  be  used  in  the 
decoder.  An  important  characteristic  of  a  recognizer  is  the 
search  strategy  that  is  used  to  find  the  word  sequence  that  best 


matches  the  input  speech.  We  believe  that  an  optimum  search 
strategy  avoids  making  local  decisions,  the  search  decisic  i 
should  be  made  globally,  based  on  scores  from  all  the  KSs 
One  such  search  paradigm  is  the  one  used  in  BYBLOS:  the 
search  is  made  top  down,  linguistically  driven,  with  tightly 
coupled  KSs. 

The  FA  is  convenient  for  deploy  mg  such  a  search 
strategy.  It  is  used  as  follows  in  our  recognizer.  We  associate 
with  each  transition  in  the  FA  a  hidden  Markov  model  for  the 
word.  This  model  is  used  to  compute  the  probabdity  of  the 
acoustic  event  (sequence  of  VQ  spectra)  given  the  occurrence 
of  the  word  ai  that  place  in  the  grammar.  Before  the  start  of 
recognition,  the  initial  state  of  the  FA  where  a  legal  sequence 
of  words  can  begin  is  initialized  to  unity,  and  all  the  other 
states  are  initialized  to  zero.  For  each  10  ms  frame  of  the  input 
speech,  the  scores  for  the  states  in  all  the  words  in  the  FA 
network  are  updated  using  modified  Baum-Welch  algorithm 
(2).  In  addition  to  state  updates  within  a  word,  a  word  can 
have  a  score  propagated  to  its  initial  state  from  its  best  see  rig 
predecessor  word.  This  simple  state  update  operation  is 
repeated  every  10  ms  for  each  FA  transition  until  the  end  of  the 
utterance  is  reached.  The  decoder  output  is  then  computed  by 
tracing  back  through  the  FA  network  to  find  the  highest 
scoring  sequence  of  words  that  end  in  the  terminal  state  of  the 
FA. 

One  potential  problem  associated  with  using  a  FA 
grammar  for  recognition  is  that  computation  is  expected  to  be 
proportional  to  the  number  of  transitions  in  the  FA.  This 
number  can  be  quite  large  for  complex  languages.  However,  in 
our  experience  with  different  grammars  in  our  recognizer,  we 
find  that  a  beam  search  effectively  reduces  the  computation  to 
a  very  manageable  level  while  maintaining  the  same 
performance  as  that  of  an  exhaustive  search. 

8.  Bybios  Recognition  Performance 

In  (2),  we  presented  word  recognition  results  for  a  334- 
word  electronic  mail  task.  In  speaker-dependent  mode,  we 
demonstrated  performance  of  90%  across  several  speakers 
without  the  use  of  a  grammar  (i.e.,  branching  factor  of  334). 
Since  then,  we  have  tested  the  system  along  many  dimensions: 
two  task  domains,  FA  grammars  with  varying  perplexities, 
varying  amounts  of  adaptation  speech,  and  different  speaker 
types.  The  results  are  tabulated  in  Figure  2.  Below  we 
describe  the  different  conditions  in  more  detail. 

Task  Domains 

The  two  task  domains  tested  are:  Electronic  Mail 
(EMAIL)  and  Naval  Database  Retrieval  (FCCBMP).  Both 
tasks  have  vocabulary  sizes  of  approximately  350  word  (334 
for  EMAIL.  354  for  FCCBMP).  A  description  of  the  task 
domain  language  was  created  using  CFG.  The  CFGs  were 
designed  to  capture  generalizations  of  linguistic  phenomena 
found  in  example  task  domain  sentences. 

Grammars 

Two  finite  state  grammars  were  generated  for  each  task 
domain:  Command  and  Sentence.  The  Command  Grammar  in 
each  case  was  designed  to  cover  only  the  command  subset  of 
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Figure  2:  BYBLOS  Recognition  Results. 

Two  task  domains  (EMAIL  and  FCCBMP), 

two  grammars  for  each  task 

(Command  and  Sentence),  and  varying 

amounts  of  training  speech 

(2  minutes  and  15  minutes).  Also  shown  are 

maximum  perplexity  measures  for  the  grammars 

the  language;  the  Sentence  Grammar  was  designed  to  cover  all 
of  the  language,  which  incluoed  both  command  and  question 
type  constructs.  The  maximum  perplexity  measures  of  the 
grammars,  as  proposed  in  [6),  arc  shown  in  Figure  2.  In  both 
tasks,  the  sentence  grammars  have  a  higher  perplexity  than 
their  command  counterparts 

Adaptation  Time 

As  described  in  Section  2,  The  BYBLOS  operate  in  two 
modes,  speaker-dependent  and  speaker-adapted.  In  speaker- 
dependent  mode,  15  minutes  of  training  speech  is  required  for 
a  speaker  This  mode  in  general  will  give  word  accuracy  in  the 
98.5+  range.  In  the  speaker-adaptive  mode,  anywhere  from  2 
minutes  down  to  15  seconds  of  speech  from  a  new  speaker  is 
needed  to  "adapt"  the  HMM  parameters  to  the  new  speaker. 
The  performance  in  this  case  is  97*^ 

Speaker  Type 

We  have  tested  BYBLOS  on  several  speakers  with 
different  dialects,  including  a  female  speaker,  a  non-native 
speaker,  and  3  naive  (uncoachcd)  speakers.  The  recognition 
results  for  these  speakers  showed  little  deviation  typical  male 
speakers  of  standard  American  dialects. 
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9.  Summary 

We  have  presented  BYBLOS,  a  system  for  large 
vocabulary  continuous  speech  recognition.  We  showed  how 
we  integrate  multiple  sources  of  knowledge  to  achieve  high 
recognition  performance  In  recognition  experiments,  we 
demonstrated  consistent  performances  across  task  domains, 
grammars,  adaptation  time,  and  speaker  type 

We  arc  currently  working  to  improve  various  aspects  of 
the  system,  including  a  real  time  implementation  of  the 
recognizer,  search  strategy,  acoustic  modeling,  and  language 
mod  din"  In  the  future,  we  plan  to  work  on  integration  of 
speech  and  natural  language  for  speech  understanding 
applications. 
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Abstract 

This  paper  presents  research  into  the  use  of  large-scale 
parallelism  for  a  continuous  speech  recognition  algorithm.  The 
algorithm,  developed  for  the  BBN  Byblos  system  [1|,  uses 
context  dependent  Hidden-Markov  models  to  achieve  high 
recognition  accuracy.  The  multiprocessor  used  in  the  research, 
the  BBN  Butterfly™  Parallel  Processor,  is  a  shared  memory, 
M1MD  machine.  The  algorithm  was  implemented  using  the 
Uniform  System  software  methodology,  a  system  that 
simplifies  parallel  programming  without  sacrificing  efficiency. 
The  algorithm  is  described,  highlighting  those  portions  critical 
to  an  efficient  parallel  implementation.  Some  of  the  problems 
encountered  in  trying  to  improve  efficiency  are  presented  as 
well  as  the  solutions  to  those  problems.  The  algorithm  is 
shown  to  achieve  79%  processor  utilization  on  a  97-node 
Butterfly  Parallel  Processor.  This  is  equivalent  to  a  speedup  by 
a  factor  of  77  over  a  single  processor  benchmark.1 

1.  Introduction 

The  introduction 'of  large-scale  parallelism  in  computers 
offers  the  potential  for  greatly  increased  speed  and  better 
performance-cost  ratios  for  algorithms  that  can  make  use  of 
this  parallelism.  This  paper  describes  the  parallel 

implementation  of  a  continuous-speech  recognition  algorithm 
that  successfully  uses  the  speedup  provided  by  a  general 
purpose  multiprocessor,  the  Butterfly  Parallel  Processor. 

The  outline  of  this  paper  is  as  follows:  Section  2 
describes  the  Butterfly  Parallel  Processor  and  the  Uniform 
System,  Section  3  describes  the  BBN  word  recognition 
algorithm.  Section  4  explain,  the  initial  parallel 
implementation  of  the  algorithm.  Section  5  describes  the 
improvements  to  the  algorithm  for  better  processor  utilization 
and  presents  results  based  on  these  improvements.  The  final 
section  presents  some  conclusions  from  the  work. 

2.  Butterfly  and  Uniform  System 

The  Butterfly  Parallel  Processor  [2]  is  composed  of 
multiple  (up  to  256)  identical  nodes,  each  containing  a 
processor  and  memory,  interconnected  by  a  high-performance 


'This  work  was  .ponsored  by  the  Defense  Advancrd  Research  Projects 
Agency  and  was  monitored  by  the  Space  ard  Naval  Warfare  Systems 
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binarw-tree  maximum  and  for  other  inform auve  discussions 


switch.  The  Butterfly  architecture  is  multiple-instruction- 
multiple-dasa- stream  (M1MD),  in  which  each  processor  node 
executes  its  own  sequence  of  instructions,  referencing  data  as 
specified  by  the  instructions,  bach  processor  node  contains 
either  a  Motorola  MC68000  or  MC68020  microprocessor,  an 
optional  floating-point  co-processor,  from  I  to  4  megabytes  of 
main  memory,  a  co-processor  called  the  Processor  Node 
Controller,  memory  management  hardware,  an  1-0  bus,  and  an 
interface  to  the  Butterfly  switch. 

The  Butterfly  switch  allows  each  processor  tr  access  the 
memory  on  every  other  node.  Collectively,  these  memories 
form  the  shared  memory  of  the  machine,  a  single  address  space 
accessible  to  every  processor.  All  interprocessor 
communication  is  performed  using  shared  memory. 
Instructions  accessing  memory  on  the  same  node  as  a  processor 
typically  taks  about  2  microseconds  to  complete,  whereas  those 
accessing  memory  on  another  node  take  about  5  or  6 
microseconds.  Block  transfers  from  one  memory  to  another 
run  at  4  megabytes  per  second.  The  machines  used  in  this 
project  were  16-processor  and  97-processor  machines,  each 
with  1  megabyte  of  memory  and  a  MC68000  microprocessor 
on  the  processor  nodes.  Neither  had  hardware  support  for 
floating  point  arithmetic. 

The  software  for  the  project  was  written  using  the 
Uniform  System,  a  programming  methodology  supported  by  a 
library  of  high-level  subroutines  [31.  The  benefit  of  using  the 
Uniform  System  is  that  it  can  provide  a  simple,  efficient 
solution  to  the  problem  of  load  balancing  for  the  memory  as 
well  as  for  the  processors.  To  balance  the  load  on  memory,  the 
Uniform  System  routines  spread  out  the  data  evenly  acres*  the 
different  physical  memories  in  the  machine.  Under  the 
assumption  that  distributed  data  will  also  distribute  memory 
accesses  fairly  evenly,  this  approach  can  reduce  the 
inefficiency  that  results  when  many  processors  attempt  to 
access  the  same  memory  simultaneously. 

To  balance  the  load  on  processors,  the  Uniform  System 
treats  processors  as  a  pool  of  identical  workers,  all  of  which 
can  execute  the  same  tasks.  In  this  way,  tasks  can  be 
dynamically  assigned  to  the  free  processors  in  the  machine.  In 
a  typical  program,  control  starts  out  in  a  single  processor  of  (he 
machine.  To  perform  tasks  in  parallel,  this  processor  calls  a 
Uniform  System  "generator"  subroutine,  specifying  a  set  of 
tasks  to  do  and  a  task  subroutine.  The  generator  creates  a 
descriptor  of  the  work  to  be  done  and  starts  all  processors.  The 
processors  (hen  perform  the  work  in  parallel,  each  taking  the 
next  task  data  from  the  descriptor  and  executing  the  (ask 
routine  with  this  data  until  all  the  work  is  completed.  At  that 
point,  control  is  returned  to  the  original  single  processor.  An 
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example  of  a  simple  generator  is  GenOnl.  The  call 
GenOnI(taak_rnutine,  Ntasks)”  a>  igns  processors  to  perform 
the  subroutine  'ta$k_routine"  for  every  integer  value  in  the 
range  1  to  Ntasks. 


3.  Recognition  Algorithm 

The  Byblos  system  has  two  major  components,  a  trainer 
and  a  recognizer.  The  recognition  component  was 
implemented  on  the  Butterfly  Parallel  Processor.  The  training 
component  uses  the  forward-backward  algorithm  [4]  to 
estimate  discrete-density  Hidden-Markov  models  of  context- 
dependent  phonemes.  It  combines  these  models  to  form  word 
models  that  are  used  in  recognition.  The  context-dependent 
models  lead  to  accurate  and  robust  recognition  performance; 
the  system  has  achieved  90%  correct  recognition  on  a  335 
word  speaker-dependent  task  with  no  grammar  [5], 

In  the  recognition  process,  input  speech  is  analyzed  every 
10  ms  and  then  vector  quantized  with  a  256-vector  codebook. 
The  analysis  and  quantization  are  done  in  real  time  on  an  FPS 
array  processor  attached  to  a  VAX.  The  quantization  codes, 
each  representing  a  frame  of  input  speech,  are  inpul  in  real 
time  over  an  ethemet  connection  to  the  search  algorithm  on  the 
Butterfly  Parallel  Processor 

The  search  algorithm  Finds  the  best  scoring  sequence  of 
words  using  the  trained  word  models.  Each  possible  sequence 
of  words  that  is  considered  is  called  a  word  sequence  theory. 
The  search  uses  the  Viterbi  decoding  algorithm  to  update 
scores  for  all  word  sequence  theories  at  each  frame  In  order  to 
prevent  underflow  during  score  updating,  all  theory  scores  are 
normalized.  To  determine  the  "normalization  factor"  for  a 
frame,  the  algorithm  computes  the  maximum  score  of  all  states 
in  all  words  in  the  frame  and  sets  the  factor  to  the  score  ceiling 
minus  the  maximum  score. 

Tlie  major  work  being  performed  in  the  algorithm  can  be 
abstracted  in  pseudo-code  as  follows: 

FOR  all  input  frames  ( 
max_score  :=  0 
besl_end_score  :=  0 
FOR  all  words  ( 

update  word  score 

IF  word_max_score  >  max_score 
max_score  :=  word  ma*_score 

IF  word_end_score  >  besl_cnd_score 
besl_end  score  :=  '»ord  end  score 

) 

determine  Initial  state  scire  for  new  theories  from  best  end  score 
determine  normalizalon  from  mas  score 

1 

determine  and  report  best  scoring  iheor, 

The  algorithm  computes  two  maxima:  "max  score",  the 
maximum  over  all  states  of  the  words  scored  in  an  inpul  frame, 
and  "best _end  .score",  the  maximum  score  of  all  words'  final 
states.  The  First  maximum  is  used  for  the  normalization  factor 
mentioned  above,  and  the  second  is  used  to  determine  the  score 
for  the  initial  state  of  all  words  in  the  next  frame. 

The  core  of  this  computation,  the  word  score  update, 
entails  updating  all  the  phonemes  in  a  word.  Each  phoneme 
update  requires  a  little  less  than  one  millisecond  of 
computation,  and  the  average  word  update  time  is  slightly 
more  than  4  milliseconds  for  the  vocabularies  used  in  this 
work. 


4.  Initial  Parallel  Implementation 

As  the  First  step  toward  a  parallel  implementation,  the 
speech  recognition  program  was  ported  from  VAX/VMS  to  a 
single  processor  of  the  Butterfly  Parallel  Processor  Both 
versions  of  the  program  were  in  the  language  'C'.  The  most 
signiFtcant  change  to  the  program  in  this  phase  was  the  use  of 
the  Uniform  System  memory  management  routines  to  store  in 
global  shared  memory  about  1.5  megabytes  of  data  that  had 
been  stored  on  disk  in  the  VAX  version. 

The  VAX  (and  the  First  Butterfly  System  implementation) 
used  floating-point  arithmetic,  but  because  floating-point 
arithmetic  is  performed  in  software  in  our  Butterfly  Parallel 
Processor,  it  is  substantially  slower  than  Fixed  point.  For  this 
reason,  Ihe  program  was  switched  to  Fixed-point  arithmetic  As 
part  of  this  change,  multiplication  of  probabilities  in  the 
original  version  was  converted  to  addition  of  corresponding  log 
probabilities.  With  this  modiFication,  the  execution  time  was 
about  two  minutes  for  a  3.5  second  utterance  from  a  1 20  word 
task.  This  is  about  the  same  speed  as  our  optimized  floating 
point  VAX  1 1/780  program. 

Examination  of  the  pseudo-code  in  the  preceding  section 
leads  to  a  natural  decomposition  of  the  algorithm:  the 
fundamental  parallel  task  is  to  update  the  score  of  a  single 
word  for  a  single  input  frame  Using  the  Uniform  system 
generator  GenOnl,  the  pseudo-code  for  the  parallel  version  of 
our  algorithm  becomes: 

FOR  all  frames  ( 
best_end_score  :=  0 
max_score  :=  0 

GenOnl(updalr_word,  N  words) 

determine  initial  slate  score  for  new  theories  from  b?sl  end  score 
determine  normalizalon  from  max  score 

1 

determine  and  report  best  scoring  theory 

In  this  version,  the  subroutine  update_word  now  includes 
the  calculation  of  max_score  and  best_end  score.  Note  that 
since  the  processor  calling  GenOnl  waits  for  all  processors  to 
finish  before  proceeding,  this  mechanism  provides  a 
synchronization  that  is  needed  to  ensure  that  no  processor 
begins  updating  words  of  a  new  frame  until  the  initial  state 
score  and  the  normalization  factor  for  the  next  frame  have  been 
computed. 

Using  this  simple  approach  to  parallel  implementation,  a 
First  timing  experiment  was  conducted  using  a  16-processor 
machine  and  a  120-word  vocabulary  task  Processor  utilization 
was  found  to  be  75%,  i.e.  the  machine  was  effectively  using 
the  computation  corresponding  to  12  of  the  16  actual 
processors.  [6|.  This  result  was  judged  to  be  good  enough  to 
proceed  directly  to  work  on  a  huger  machine.  The  first  time 
the  program  was  tun  on  a  97-processor  machine,  processor 
utilization  was  approximately  20%.  Although  this  represents  a 
factor  of  20  speedup  of  the  program,  it  is  an  inefficient  me  of 
the  machine  The  next  section  presents  several  factors  that 
contributed  to  the  inefficiency  as  well  as  the  methods  used  to 
improve  them. 
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5.  Efficiency  Improvements  and  Results 

There  are  a  number  of  potential  obstacles  to  attaining 
efficient  processor  utilization  on  a  multiprocessor  Typical 
issues  include  contention  for  a  common  memory  location, 
serial  code  in  the  program,  and  processors  waiting  idly  to 
synchronize  with  other  processors.  Each  of  the  specific 
problems  described  below  includes  one  or  more  of  these  issues. 

Number  of  Tasks  and  Starts  ip  Ove.head 

Even  before  the  program  was  run  on  a  larger  machine,  we 
had  anticipated  that  it  would  be  had  to  obtain  high  processor 
utilization  with  a  vocabulary  as  small  as  120  words.  Since  our 
long-term  goal  is  to  recognize  speech  from  large  vocabularies, 
we  switched  to  a  larger  task  of  335  words.  This  change 
improved  processor  utilization  to  35%  on  the  97-processor 
machine. 

The  speed  of  processor  scheduling  was  examined  next.  In 
the  initial  parallel  version  shown  above,  the  generator 
subroutine  call  starts  all  the  processors  at  each  frame.  It  was 
found  that  the  overhead  of  starting  up  was  relatively  large  for 
the  amount  of  work  being  done  at  each  frame.  To  reduce  the 
overhead,  the  program  was  altered  to  start  all  processors  only 
once  at  utterance  start,  generating  NframesxNWords  tasks  at 
that  point  and  letting  each  processor  determine  its  word  and 
frame  indices  from  the  single  task  index  it  receives  from  the 
generator.  Processor  utilization  improved  to  about  50%  with 
this  change. 

Processor  Synchronization  Issues 


word-starting  score.  This  score,  however,  is  used  only  at  the 
beginning  of  the/irsf  phoneme  of  each  word.  Considering  this, 
the  order  of  the  update  of  a  word  was  reversed  so  that  the  last 
phoneme  was  updated  fust,  and  the  fust  phoneme  updated  last 
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Figure  I:  Ordering  Tasks  by  Length 


This  change  allowed  a  processor  to  finish  work  on  one  frame 
and  immediately  begin  work  on  updating  a  word  from  the  next 
frame,  synchronizing  only  when  it  got  to  the  first  phoneme.  In 
this  way,  time  that  had  been  previously  spent  by  processors 
waiting  for  others  to  finish  a  frame  was  now  being  used  to 
perform  useful  work  from  the  next  frame. 


The  task  generation  change  had  removed  the 
synchronization  provided  by  starting  up  a  new  generator  at 
each  frame.  To  replace  this,  an  explicit  synchroniza'ton  was 
built  into  the  program  to  be  performed  after  all  thr  words  in  a 
frame  were  processed.  There  were  two  subsequent  changes  to 
improve  the  efficiency  of  synchronization.  The  first  dealt  with 
task  ordering.  In  the  early  versions  of  the  algorithm, 
processors  updated  all  the  words  in  the  vocabulary,  with  no 
particular  ordering  of  the  words.  Since  words  have  varying 
nOmbers  of  phonemes  (from  one  to  14  phonemes  in  this  task’s 
vocabulary),  different  words  took  different  amounts  of  time  to 
update.  If  a  processor  began  work  on  a  long  word  near  the  end 
of  the  work  for  a  frame,  other  processors  would  finish  their 
assigned  words  ana  wait  idly  to  synchronize  with  the  one  busy 
processor.  To  reduce  this  inefficiency,  the  words  were 
processed  in  order  from  longest  to  shortest  (in  number  of 
phonemes). 

In  figure  1,  we  schematically  depict  the  situation  before 
and  after  the  words  are  ordered.  The  filled  rectangles  represent 
time  when  processors  actively  work  on  task:,  and  the  white 
space  represents  time  between  tasks  when  no  work  is  being 
accomplished.  In  the  right  hand  part  of  the  figure,  idle 
processor  time  is  substantially  reduced  by  sorting. 

The  second  change  to  synchronization  efficiency 
concerned  the  point  in  the  program  at  which  .  hronization 
was  donr  As  mentioned,  the  purpose  of  the  synchronization 
was  to  ensure  that  no  processor  proceeded  to  the  next  frame 
until  the  starting  score  for  words  and  the  normalization  factor 
were  computed.  Since  the  normalization  tactor  was  anly  to 
avoid  score  underflow,  it  could  He  estimated  a  frame  or  more 
earlier.  The  only  remaining  synchronization  constraint  was  the 


Figure  2  depicts  the  situation  for  two  frames  of  an 
utterance  before  and  alter  this  change.  Tasks  for  time  T+l  are 
shown  in  two  shades.  The  darker  portion  represents  the  part  of 
the  task  that  depends  on  the  previous  frame's  work  being 
finished.  On  the  right,  with  the  the  order  of  the  computation 
reversed,  the  idle  processor  time  is  reduced.  The  effect  of  the 
synchronization  changes  was  to  increase  processor  utilization 
to  approximately  72%. 
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Figure  2:  Reversing  Word  Computation  Order 
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Finding  Global  Maximum 


6.  Conclusions 


Finally,  the  efficiency  of  finding  *he  maximum  value  wai 
Jso  improved,  A  straightforward  com  nutation  of  the 
maximum  value  requires  that  all  values  be  comnared  with  a 
single  memory  location,  but  this  approach  results  in  contention 
for  that  location.  As  a  first  improvement,  the  prog,  am  was 
altered  to  make  each  processor  maintain  its  own  local 
tr-ximum  of  the  scores  of  all  the  words  that  it  updates  in  a 
fiatne.  At  the  end  of  the  frame,  the  global  maximum  or  there 
v  dues  over  all  processors  was  determined.  In  initial  versions, 
this  was  accomplished  by  having  processors  sequentially 
compare  their  value  to  the  global  location  and  replace  it  if 
necessary  Although  on  a  sixteen  processor  machine,  the  time 
for  processors  to  turn  in  values  in  this  way  is  negligible,  with 
97  processors,  the  inefficiency  of  the  approach  becomes 
noticeable. 

At  alternative  to  this  approach  was  to  set  up  a  "binary 
tree"  of  locations  for  taking  the  maximum.  In  this  approach, 
the  processors'  local  maxim’!  are  the  leaves  of  the  tree  and  the 
maxima  are  propagated  up  through  the  nodes  of  the  tree.  This 
approach  reduces  the  asymptotic  time  for  finding  a  global 
maximum  from  0(N)  to  Oflog  N),  where  N  is  the  number  of 
processors.  More  importantly  in  our  case,  efficiency  improved 
because  memory  contention  w  is  reduced. 

The  total  effect  of  all  the  improvements  described  above 
was  to  improve  processor  utdization  on  a  97-processor 
n  shine  from  20%  to  79%.  Figure  3  is  a  graph  of  processor 
utilization  for  1  to  97  processors  on  the  335  word  task.  The 
actual  speed  of  the  speech  recognition  improved  accordingly. 
After  the  optimizations  are  included,  a  one-processor  Butterfly 
Parallel  Processor  requires  128  times  real  time  (128  seconds  to 
process  one  second  of  input  speech)  -aid  a  97-processor 
machine  requires  'bout  1 .7  times  real  time. 


Figure  3:  Butterfly  Processor  Utilization,  335  words 


Thu  work  has  shown  that  the  Butterfly  archiiecture  is 
suitable  for  continuous  speech  word  recognition.  The 
algorithm  was  implemented  efficiently  without  changing  the 
type  or  amount  of  computation  performed.  Some  ingenuity 
was  re  cm  ire  d  to  obtain  an  efficient  realization,  but  once  the 
obstacles  were  understood,  solutions  presented  themselves 
fairly  readily.  The  memory  and  processor  management 
functions  of  the  Uniform  System  made  initial  parallelization  of 
the  algorithm  quite  easy  and  provided  several  alternatives  for 
improving  implementation  efficiency  when  required. 

We  draw  several  broad  conclusions  about  efficient 
parallel  programming  as  well.  Most  obviously,  ana  perhaps 
most  importantly,  it  is  crucial  that  sequentially  executed  code 
be  eliminated  wherever  possible.  Similarly,  much  of  the 
inefficiency  in  our  original  multiprocessor  program  wa,  due  to 
processors  waiting  for  each  other.  Synchronizing  processors 
only  after  all  other  possible  work  is  done  was  found  to  be  a 
good  strategy  to  avoid  this.  Additionally,  it  can  he  very 
important  to  minimize  the  overhead  of  parallel  constructs  such 
as  starting  processors. 
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ABSTRACT 

A  new  model  is  proposed  for  the  transformation  in  the  coch¬ 
lea  from  Basilar  membrane  /ibration  to  nerre  fiber  responses 
in  the  VUIth  nerre.  The  model  has  been  incorporated  into 
a  system  for  speech  processing  that  we  are  urrently  using  as 
a  front  end  in  a  speech  recognition  system  under  development. 
Wo  have  found  that  spectral  represent  ations  based  on  this  model 
show  certain  advantages  over  traditional  method*  for  'pectral 
analysis  for  particular  applications. 

We  believe  that  this  model  represents  the  auditory  periph¬ 
ery  much  more  accurately  than  previou*  models  that  we  have 
u*ed.  The  most  important  change  is  that  a  new  adaptation 
model  is  used,  one  that  was  originally  proposed  by  Goldhor  [1|. 
With  this  adaptation  model  it  is  possible  to  obtain  reasonable 
matches  to  the  equal- incremental- response  criterion  imposed  by 
the  Smith  and  Zwislocki  data  [2|.  Parameters  of  the  system 
were  adjusted  so  as  to  match  this  criterion  as  well  as  possible. 
In  addition,  the  model  was  compared  with  auditory  data  in  four 
other  xpenmental  categories  as  well,  as  discussed  in  the  paper. 
These  categories  were  selected  because  we  believe  they  reflect 
important  aspects  of  the  response  for  speech  applications.  Ex¬ 
amples  are  <pven  of  outputs  of  the  system  for  speech  signals  in 
order  to  illustrate  how  the  nonlinearities  in  the  model  affect  the 
responses  to  speech. 


INTRODUCTION 

The  peripheral  auditory  system  is  typically  modelled 
by  a  bank  of  linear  filters  which  resemble  available  data 
on  the  .shapes  of  auditory  filters,  followed  by  a  nonlinear 
stage  that  attempts  to  capture  the  dynamics  of  the  trans¬ 
formation  from  Basailar  membrane  vibration  to  nerve  fiber 
response.  This  part  of  the  model  incorporates  such  non- 
linearities*  ad  dynamic  range  compression  and  half-wave 
rectification,  ?nd  also  captures  effect*  juch  as  uhort-term 
adaptation,  rapid  adaptation,  and  forward  masking.  It  is 
very  difficult  to  devise  a  scheme  that  will  accurately  repro¬ 
duce  diverse  aspects  of  auditory  respon*e,  yet  we  feel  that 
it  Is  very  important  in  speech  processing  for  these  rspecta 

"Thu  r*sear*h  was  supported  by  DARPA  under  Contract  NC 9039-85* 
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to  be  more-or-less  cor'ect.  An  enormous  amount  of  data 
is  available  from  measurement*  made  from  auditory  nerve 
fibers  for  a  number  of  different  experimental  paradigm*.  A 
useful  goal  is  to  attempt  to  reproduce  gross  features  that 
emerge  from  *uch  jxperiment3.  We  selected  auditory  data 
from  five  different  categories  of  response  measurements  to 
be  compared  with  the  model.  These  demonstrate  the  de¬ 
gree  of  success  in  capturing  the  detailed  wave  shape  in 
steady  state  conditions,  the  dynamics  of  onset  response, 
the  degree  of  forward  masking,  the  extent  of  loss  of  syn¬ 
chrony  at  high  frequencies,  and  the  incremental  response 
characteristics  at  onset  and  steady  state. 


MODEL  DESCRIPTION 

Figure  1  shows  a  block  diagram  of  our  current  auditory- 
based  front-end  system.  The  initial  stage  is  a  bank  of 
linear  filters,  which  is  followed  by  the  “hair-cell/synap*e" 
stage  that  introduces  the  nonlinearities.  A  bifurcation 
of  the  outputs  leads  to  two  spectral- like  representation*, 
the  “Envelope  Spectrogram*  and  the  “Synchrony  Spec¬ 
trogram.*  The  only  part  of  this  model  that  has  been 
changed  since  previous  reports  [3|  is  the  hair-cell/synapse 
stage.  The  new  model  for  this  stage  consists  of  four  sub¬ 
components,  as  shown  in  Figure  2:  a  half-wave  rectifier,  a 
short-term  adaptation  circuit,  a  lowpass  filter,  and  a  rapid 
Automatic  Gain  Control  (AGC).  We  will  discuss  each  of 
these  components  in  turn. 

All  of  these  components  except  the  lowpass  filter  are 
nonlinear,  and  therefore  the  final  output  is  affected  by  the 
ordering  of  the  components.  A  particular  ordering  can  be 
justified  in  part  by  forming  associations  with  elements  of 


Figure  1:  Block  diagram  if  our  computer  model 
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Figure  2:  Block  di  gr am  of  the  subcomponents  of  Sta^e  II  with 
suggested  auditory  system  iftili&tioiL'  indicated  at  right 


Figure  2:  Plot  of  input-output  response  of  the  half-wave  rec¬ 
tifier  used  in  the  model.  This  mapping  resembles  the*  hair  cell 
■'urrent  response  as  measured  for  frogs  (4| 


the  actual  auditory  system.  Such  links  can  also  aid  in  the 
design  of  each  individual  component.  To  the  right  of  each 
component  in  the  figure  is  proposed  a  corresponding  af¬ 
filiation  with  the  auditory  system.  The  hair-cell  current 
response  as  measured  for  amphibians  shows  a  distinct  di¬ 
rectional  sensitivity  (7j.  It  is  not  clear  that  the  current  is 
a  di.ee 4  link  in  the  respouse  mechanism;  nonetheless,  it  is 
tempting  to  assume  that  half-wave  rectification  first  occurs 
in  the  hair  cell,  and  hence  this  is  the  first  component  in 
the  model.  There  seems  to  be  no  evidence  for  short-tern 
adaptation  in  hair  cell  current  or  voltage  responses;  there¬ 
fore  it  is  generally  assumed  that  this  effect  is  introduced 
in  the  synapse  between  the  hair  cell  and  the  neuron  (51. 
The  logical  ordering  is  therefore  to  place  this  component 
second. 

The  AGC  is  assumed  to  be  affiliated  with  the  refrac¬ 
tory  phenomenon  of  nerve  fibers;  therefore,  this  compo¬ 
nent  should  be  placed  late  in  the  series.  Such  an  affiliation 
implies  that  the  rapid  adaptation  component  of  responses 
to  onsets  is  due  to  the  refractory  phenomenon,  a  theory 
that  has  been  proposed  by  Johnson  and  Sw&mi  (6).  It  is 
difficult  to  know  where  to  place  the  lowpass  filter.  It  is  as¬ 
sociated  with  the  gradual  less  of  synchrony  in  nerve  fiber 
responses  as  stimulus  frequency  is  increased.  The  locus 
(or  loci)  of  such  synchrony  loss  has  not  yet  been  deter¬ 
mined.  The  lowpass  filter  must  follow  the  half-wave  rec¬ 
tifier,  because  it  only  makes  sense  after  signal  energy  has 
been  preserved  through  a  DC  component.  The  solution 
adopted  was  to  try  placing  the  lowpass  filter  in  all  three 
of  the  remaining  positions,  and  choose  the  one  that  yields 
the  best  behavior  in  the  final  response. 

The  model  for  the  half-wave  rectifier,  whose  response 
function  is  jhown  in  Figure  3,  is  defined  mathematically 
aa  follows: 

1  +  A  tan-1  Bs  x  >  0 

<rvD'  x  <  0 

It  is  thus  exponential  for  negative  signals,  linear  but  shifted 


(by  a  "spontaneous*  rate  of  1)  for  small  positive  signals, 
and  compressive  for  larger  signals,  saturating  at  1  +  A  ~  II. 
It  is  baaed  on  the  measured  hair  cell  current  responses  as  a 
function  of  a  fixed  displacement  of  the  cilia  as  determined 
in  frogs  by  Hudspeth  and  Corey  (7J. 

The  model  for  short-term  adaptation  is  very  similar 
to  one  proposed  by  Goldhor  (lj.  It  consists  of  a  simple 
nonlinear  circuit,  as  shown  in  Figure  4.  The  input  is  the 
Voltage  source,  V),  and  the  output  is  the  current  through 
the  conductance  Gl.  Gl  is  in  series  with  a  diode,  which 
serves  to  lock  out  this  branch  of  the  circuit  whenever  the 
voltage  across  Gl  becomes  negative  (the  ‘ofP  condition). 
There  is  another  conductor,  G2,  in  parallel,  in  addition  to 
a  capacitor.  The  capacitor  accumulates  a  charge  whenever 
the  signal  7,-  is  sufficiently  positive,  and  discharges  through 
G]  when  the  stimulus  voltage  falls  below  the  capacitor 
voltage. 

Goldhor  showed  that  such  a  circuit,  when  applied  us¬ 
ing  the  envelope  of  the  stimulus  as  the  input  V',  obeys  the 
equal  incremental  response  property  of  short-term  adap¬ 
tation  (2],  and  also  appropriately  exhibits  a  longer  time 
constant  for  recovery  after  signal  offset  than  for  adapta¬ 
tion  after  signal  onset.  The  latter  property  holds  because 
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Figure  4:  Goldhor  s  (1)  adaptation  circuit  In  our  model,  the 
output  of  the  half-wave  rectifier  of  Flrure  3  is  used  as  V,  'he 
input  to  the  adaptation  -ircuit. 
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during  recovery  *he  diode  is  turned  off,  and  the  capac:tor 
discharges  only  through  Gj,  whereas  after  increment*  the 
diode  tends  to  be  on,  and  current  can  flow  through  both 
conductors  to  charge  the*  capacitor  more  quickly. 


Ou:  model  uses  the  same  circuit,  except  that  the  de¬ 
tailed  cycle-by-cycle  behavior  of  the  input  signal  is  pre¬ 
served  in  V,’.  The  consequence  is  that  the  diode  turns  on 
and  off  for  each  period  of  the  stimulus,  and  an  adapted  re¬ 
sponse  is  obtained  only  after  the  capacitor  reaches  a  steady 
state  condition  in  which  the  amount  of  charge  gained  dur¬ 
ing  the  time  in  which  the  input  voltage  level  is  higher  is 
;xactly  the  <ame  as  the  amount  lost  during  the  remaining 
portion  of  the  cycle.  One  consequence  is  that  the  effec¬ 
tive  time  constant  for  adaptation  lies  somewhere  between 
the  “on"  time  constant,  q,  and  the  “off"  time  constant, 
q.  The  time  constant  for  recovery,  on  the  other  hand,  is 
equal  to  q. 


The  current  through  the  diode  branch  of  the  adap¬ 
tation  circuit  is  next  processed  through  a  lowpass  filter 
that  achieves  two  important  effects,  it  reduces  synchrony 
to  high-frequency  stimuli  and  it  smooths  the  square  wave 
shape  encountered  in  the  half-wave  response  for  saturat¬ 
ing  stimuli.  The  lowpass  filter  was  realised  as  a  cascade  of 
nif  leaky  integrators,  each  with  an  identical  time  constant 
rLP.  The  two  parameters,  nLP  and  rLP  were  adjusted  to 
match  available  data  on  synchrony  loss  (8|. 

The  final  component  is  the  rapid  AGC,  which  is  defined 
as  follows: 

(J) 

where  Kmc  is  a  constant  and  <>  symbolizes  “expected 
value  of,”  obtained  by  processing  z(n|  through  a  first-order 
lowpajs  filter,  with  time  constant  f^cc-  This  equation 
resembles  in  form  the  formula  obtained  theoretically  by 
Johnson  and  Swam!  6|  as  a  steady-state  solution  for  a 
simple  model  of  the  refractory  effect,  where  it  is  assumed 
that  a  response  is  locked  out  for  a  time  interval  A  after  a 
spike  occurs: 
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Figure  5  shows  the  outputs  of  intermediate  stages  of 
the  2000- Hz  channel  in  response  to  a  high-amplitude  tone 
at  CF.  The  envelope  of  the  response  over  a  long  time  inter¬ 
val  is  shown  on  the  left,  and  the  detailed  waveshapes  n*ar 
tone  •'■"et  are  shown  on  the  right.  Part  a  dhows  the  re¬ 
sponse  after  only  the  linear  filter  of  Stage  I.  Part  k  shows 
the  response  after  the  injtantaneous  half-wave  rectifier. 
The  square  wave  jhap^s  introduced  here  are  lost  after  the 
lowpajs  filter.  The  effects  of  the  short-term  adaptation 
component  are  apparent  in  the  envelope  response  on  the 
left  in  part  c.  The  final  AGC  further  alters  the  dynamka 


pnifliimiiiiiiiil 

HH  '  *  i 

111 

KM! 

u.UI 

—  00  ms  — 

fll 

Figure  5:  Responses  at  intermediate  stages  of  the 

hair-cell /synapse  model  for  lOO^-Hs  tone  at  CF  at  high  signal 
level:  (a)  after  critical  band  filter,  (b)  after  half-wave,  (c)  after 
short-term  adaptation  and  lowp ass  filter,  and  (d)  after  ACC 


of  the  onset,  to  produce  a  trend  quite  typical  of  auditory 
nerve  fibers,  as  shown  in  part  <L 

■  COMPARISONS  TO  AUDITORY  DATA 

The  above  system  has  a  number  of  parameters  that 
can  be  adjusted  according  to  some  criteria  based  an  rele¬ 
vant  auditory  data  from  the  literature.  At  the  3ame  time, 
the  degree  of  success  in  the  matching  process  can  help 
to  evaluate  the  model's  adequacy  for  capturing  auditory 
phenomena.  The  following  data  were  selected  as  responses 
that  should  be  matched,  in  part  based  on  a  judgment  of 
which  aspects  of  the  auditory  response  are  likely  to  be 
significant  with  regard  to  speech  analysis. 

•  Envelope  response  characteristics  at  tone  onsets  as 
a  function  of  tone  level, 

•  Forward  masking  effects  as  a  function  of  masker  level 

•  Period  histogram  respon.  is  in  steady  state  condi¬ 
tions  for  one-formant  vowel  stimuli,  as  a  function  of 
stimulus  level, 

•  Equal  incremental  response  characteristic,  and  5:2 
on.iet-to-iiteady-atate  ratio,  and 

•  Synchrony  falloff  characteristics  as  a  function  of  tone 
frequency. 

The  parameters  of  the  system  ware  adjusted  to  match 
all  of  the  above  criteria  as  wtll  as  possible.  Several  it¬ 
erations  through  the  matching  process  ware  neceaaary  for 
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convergence.  Some  surprising  results  emerged  from  the  ex- 
ercise;  most  remarkable  was  that  rs  for  the  Goldhor  adap¬ 
tation  circuit  had  to  be  set  to  a  much  larger  value  than 
wa«  anticipated  in  order  to  match  the  forward  marking 
data.  Another  discovery  was  that,  although  the  short-term 
adaptation  component  and  the  AGO  component  interact 
in  a  complex  way,  it  is  possible  to  set  their  paramaters  so 
(hat  the  equal-increment  criterion  imposed  by  the  Smith 
and  Zwislocki  experiment  is  reasonably  well  matched.  We 
will  discuss  each  of  the  above  criteria  in  turn,  in  each  case 
showing  a  plot  of  the  auditory  data  and  the  correspond¬ 
ing  mcdtfl  response.  The  output  of  the  half-wave  rectifier 
was  multiplied  by  a  gain  term,  Gh w,  which  wa*  adjusted 
to  yield  a  final  output  that  could  be  equated  with  a  firing 
rate.  In  all  cases,  the  various  time  constants  of  the  model 
were  set  at  fixed  values,  according  to  Table  1. 


Tone  Onsets; 

Delgutte  [9j  plotted  the  envelopes  of  responses  of  cat’s 
ear  nerve  fibers  to  tone  bursts  as  a  function  of  eight  dif¬ 
ferent  tone  levels,  as  shown  here  in  Figure  6a.  The  experi¬ 
mental  paradigm  was  reproduced  for  the  computer  model, 
and  the  resulting  responses  are  shown  in  Figure  6b.  On¬ 
set  response  characteristics  are  largely  dominated  in  the 
mod.il  by  the  parameters  of  the  rapid  AGO  component. 

Forward  Masking: 

Delgutte’s  [9]  plots  for  a  forward  masking  experiment 
are  shown  in  Figure  7a,  along  with  the  results  of  the  com¬ 
puter  model  in  Figure  7b.  The  plots  are  given  as  a  function 
of  adapter  level,  with  the  test  tone  level  held  fixed.  The 
main  controlling  factor  of  forward  masking  in  the  model 
is  r}  of  the  short-term  adaptation  circuit. 


Period  Histograms*. 

Delgutte’s  j9l  plots  of  the  period  histograms  of  respon  is 
to  a  one-formant  vowel  stimulus  are  shown  in  Figure  8a, 
along  with  the  model  results  in  Figure  8b.  In  both  the  au¬ 
ditory  data  and  the  model  data,  the  formant  bandwidth 
in  the  response  appears  to  become  larger  (more  rapid  de¬ 
cay  with  each  period)  at  ir’  mediate  amplitudes,  and 
much  .imall«>r  at  large  amplitudes,  when  saturation  effects 
are  dominating  the  response.  The  half-wave  rectifier  is 
the  controlling  factor  in  this  steady-state  phase- locked  re¬ 
sponse  characteristic,  although  the  short-term  adaptation 
circuit  also  plays  a  role. 


Figure  7 
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Figure  6i  (/e/t)  Response  patterns  of  an  auditory  nerve  fiber 
to  a  tone  burst  as  a  function  of  signal  level  (from  Delgutte 
j9j).  The  180-ms  burst  has  a  nse/fall  time  of  .25  ms,  and  a 
frequency,  770  Hi,  approximately  equal  to  the  fiber  CF.  The 
post-stimulus-time  (PST)  histogram  was  computed  with  a  bin 
width  of  1.4  nr  and  then  smoothed  with  a  three-point  smoother 
Figure  7:  (left)  Response  patterns  of  an  auditory  nerve  fiber 
to  a  20- ms  test  tone  preceded  by  a  200-ms  adapting  tone  (from 
Delgutte  [9]).  Both  tones  have  a  rise  time  of  2.5  ms.  and  a 
frequency,  1220  Hi,  approximately  equal  to  the  fiber  CF.  His¬ 
tograms  are  computed  with  a  1-ms  bin  width,  and  three-point 
smoothed,  (right)  Response  patterns  for  the  computer  model  for 
the  same  stimulus  conditions,  using  a  3-ms  Hamming  window 
for  smoothing. 


Figure  8:  (left)  Response  patterns  of  an  auditory  nerv-  fiber 
to  a  single-formant  synthethic  stimulus  as  a  function  of  signal 
level  (from  Delgutte  [9|).  The  stimulus  has  an  SOO-Hz  furmant 
frequency,  approximately  equal  to  the  fiber  CF.  Formant  band¬ 
width  is  70  Hi,  and  the  fundamental  frequency  of  voicing  is  100 
Hz.  The  10-ms  period  histogram,  computed  with  a  50-jzs  bin- 
width,  is  repeated  twice  in  each  case,  to  show  two  pitch  periods 
of  the  response  (ngnt)  Response  patterns  for  the  model  for 
the  same  stimulus  conditions.  The  responses  in  tfc*  caoe  are 
unsmoothed. 
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Incremental  Responses: 

Smith  and  Zwisiocki  [2j,  using  tone  pedestals  as  stimuli, 
mc.sured  rate  responses  of  guinea  pig  auditory  nerve  fibers 
a s  a  function  of  time.  The  stimuli  consisted  of  sudden* 
onset  tone  bursts  whose  amplitudes,  7,  were  incremented 
by  an  amount  61  at  a  time  r  =  150  its  after  initial  onset. 
A  PST  histogram  of  the  response  was  computed,  and  a 
difference  between  the  response  just  before  and  just  after 
the  amplitude  increment  constituted  a  'steady  state  in* 
elemental  response.”  This  incremental  response,  defined 
by  IR  =  Rf  -  R~,  was  then  compared  with  an  'onset  in¬ 
cremental  response,”  defined  as  the  difference  between  the 
response  to  an  onset  tone  at  level  I  +  61  and  one  at  level 
I.  Two  important  observations  were:  1)  The  steady-state 
and  onset  IR's  were  nearly  equal  for  stimuli  of  intermedi¬ 
ate  range,  but  the  steady-state  IR  was  somewhat  larger  for 
stronger  stimui,  and  2)  the  ratio  of  the  response  TZg  at  on¬ 
set  to  the  response  R~  at  steady  state  wai'  approximately 
equal  to  2.5,  regardless  of  the  onset  intensity  level. 

This  was  the  most  difficult  experimental  paradigm  to 
match  with  the  model.  The  rapid  AGC  and  the  short-term 
adaptation  circuit  tend  to  impose  opposing  constraints  on 
the  outputs.  It  was  possible  to  obtain  a  fairly  constant 
ratio  of  onset  to  steady-state  response,  but  this  ratio  w«a 
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Figure  9:  (a)  Plots  of  onset  firing  rates  versus  steady-state  fir¬ 
ing  rates,  in  response  to  tone  pedestals  it  CF,  for  two  auditory 
n  -urons  (left,  from  Smith  and  Zwisiocki  [2|).  and  for  the  com¬ 
puter  model  ( ngAt).  Model  response  is  for  the  2000-Hz  channel, 
(b)  ( left)  Plots  of  median  normalized  3-dB  incremental  responses 
for  10  auditory  n 'urons  (fr'm  Smith  and  Zwisiocki  (2|)  at  on¬ 
sets  (open  circles)  and  at  steady-state  conditions,  (righ  iota 
of  normalized  3-dB  incremental  responses  for  model,  at  onsets' 
(open  circles)  and  at  steady-state  conditions. 


consistently  too  large  (3.0  instead  of  2.5),  as  shown  in  Fig¬ 
ure  9a.  For  the  parameter  settings  shown  in  Table  I,  the 
3-dB  onset  incremental  response  of  the  model  was  slightly 
larger  than  the  3-dB  steady-state  incremental  response  for 
weak  signals,  but  became  lignificantly  smaller  for  strong“r 
signals.  This  result  is  in  close  agreement  with  the  data,  as 
.town  in  Figure  9b. 


Synchrony  Fallofif: 

Johnson  [8j  gave  a  specific  definition  for  a  'synchro 
nization  index”  that  he  applied  to  the  period  histograms 
of  the  steady-state  responses  of  nerve  fibers  to  tone  stim¬ 
uli.  ThL  index  wa*  defined  as 


S,  =  A(F0)/A(  0)  (4) 


where  5/  is  the  synchronization  index,  A(f)  is  the  ampli¬ 
tude  of  the  spectrum  of  the  period  histogram  at  frequency 
/,  a  d  Fo  is  the  tone  frequency.  Johnson  measured  5/ 
for  a  large  number  of  fibers,  for  ton-a  not  necessarily  at 
CF,  and  obtained  the  plot  shown  in  Figure  10.  Superim¬ 
posed  as  large  triangles  on  the  plot  are  points  obtained  by 
applying  the  same  definition  for  synchrony  to  the  model 
outputs.  The  main  factor  controlling  the  synchrony  falloff 
in  the  model  is  the  lowpass  filter. 


Figure  10:  Scatter  diagram  of  synchronization  index  (from 
Johnson  |8|)  as  defined  in  equation  4  as  a  function  of  tone  fre¬ 
quency  (339  measurements  from  233  units),  with  model  results 
superimposed  as  large  triangles. 


JO 


OUTPUTS  OF  THE  MODEL  FOR 
SPEECH  SIGNALS 

Figure  11  ahows  au  example  of  the  Stage  II  outputs  for 
a  short  segment  of  a  male  speaker’s  voiced  speech,  during 
the  /e/  of  the  word  make.”  Part  a  gives  the  wideband 
spectrogram  of  the  word,  with  a  vertical  bar  indicating  the 
time  at  which  channel  outputs  are  ~hown  in  part  b.  The 
50  ms  time  window  includes  about  five  pitch  periods.  The 
peaks  are  skewed  slightly  to  the  left  for  low  frequencies,  a 
feature  that  ha.  been  observed  in  auditory  data  as  well  [8|. 
Part  c  of  the  figure  shows  the  output  of  the  channel  whose 
CF  is  at  Ft  of  the  vowel.  A  prominent  component  at  the 
formant  frequency  is  evident.  Such  formant  periodicity  is 
utilized  by  the  synchrony  algorithm  in  Stage  m. 

Figure  12  compares  Stage  I  outputs  with  Stage  II  out¬ 
puts  for  the  word  “description’’  spoken  by  a  female  speaker. 
Each  waveform  is  the  smoothed  output  of  one  of  the  40 
channels  as  a  function  of  time,  with  low-frequency  channels 
at  the  bottom.  It  is  essential  to  represent  Stage  I  outputs 
by  a  log  magnitude  rather  than  a  magnitude;  otherwise 
the  vowel  peaks  are  overwhelmingly  larger  than  the  rest 
of  the  data.  Log  magnitude  also  corresponds  to  traditional 
analysis  methods.  Because  of  the  saturating  nonlinearity 
in  the  half-wave  rectifier  as  well  as  in  the  final  AGC,  a  log 
representation  is  not  appropriate  for  Stage  II  outputs.  The 
phonetic  transcription  has  been  superimposed,  to  help  in 


judging  where  segment  boundaries  should  be  detected. 

All  segment  boundaries,  with  the  exception  of  the  /n/, 
are  well  delineated  in  the  Stage  II  representation.  The  clo¬ 
sure  intervals  for  both  the  / k/  and  the  fp/  are  flat  valleys 
in  the  Stage  II  representation;  there  is  clear  evidence  for 
forward  masking  here,  particularly  in  the  low-frequency 
region  for  the  /p/.  The  vowel  /l/  has  masked  k  're- 
quency  noise  not  only  during  the  fp/  doaure  interval  but 
aLo  during  the  subsequent  // /.  Such  masking  pi  mornena 
should  enhance  the  contrast  between  vowels  and  fricatives. 
The  boundary  between  the  /*/  and  the  final  /n/  is  very 
difficult  to  see  in  the  Stage  I  representation,  but  there  is  a 
much  greater  hope  of  detecting  it  after  the  Stage  □  non- 
linearities.  The  stop  burst  onsets  for  the  /d/  and  the  /k/ 
are  also  much  sharper  after  Stage  II. 

SUMMARY  AND  CONCLUSIONS 

This  paper  describes  the  nonlinear  component  of  a  rel¬ 
atively  simple  model  for  auditory  processing  of  speech  sig¬ 
nals,  which  attains  a  rea  onably  good  match  to  measured 
auditory  responses  for  a  number  of  different  experimen¬ 
tal  paradigms.  The  model  offers  the  hope  of  elucidating 
further  the  nature  of  auditory  response  to  speech.  In  ad¬ 
dition,  we  anticipate  that  representations  obtained  from 
such  a  model  will  be  well-suited  to  applications  in  com¬ 
puter  speech  recognition. 
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Figure  11:  (a)  Wideband  spectrogram  of  the  word  “make.’ 
spoken  by  a  male  speaker,  (b)  Stage  II  outputs  of  40  channels, 
with  the  lowest  frequency  channel  at  the  top,  for  five  pitch  pe¬ 
riods  during  the  vowel  /e/  at  the  time  of  the  vertical  bar  in 
part  a.  (c)  Output  of  *he  single  channel  at  the  frequency  of  the 
second  formant  at  the  *ame  time  as  in  part  b. 


Figure  1‘  a)  Log  magnitude  response  of  Stage  I  output* 
for  the  word  ‘description'  spoken  by  a  female  speaker  with  the 
lowest  frequency  channel  at  the  bottom.  ( bl  Magnitude  response 
of  Stage  II  outputs  for  the  same  word.  The  phonetic  frau*rrip- 
tion  is  superimposed  on  the  plots,  and  the  urigmal  waveform  is 
shown  below  in  each  case. 
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The  Smith  and  Zwislocki  data  showing  a  constant  ra¬ 
tio  of  ons«t  to  steady- stats  response,  and  a  clocs-to-equal 
incremental  response  characteristic  for  onset  and  steady- 
state  conditions,  have  led  to  the  hypothesis  that  the  adap¬ 
tation  process  might  be  linear  in  nature.  A  possible  alter¬ 
native  explanation,  based  on  the  results  from  the  model 
described  here,  is  that  this  apparently  linear  feature  may 
be  attributed  to  a  cascade  of  an  enhancing  nonlinearity 
with  a  compresiive  nonlinearity,  such  that  the  two  effec¬ 
tively  cancel  one  another  under  certain  conditions. 

The  model  used  for  the  AGC  is  a  poor  approxima¬ 
tion  of  the  refractory  effect  as  it  is  currently  understood. 
First,  equation  3  is  only  valid  for  steady-state  conditions, 
and  only  exact  for  signals  that  an  periodic  with  A.  Sec¬ 
ond,  a  leaky  integrator  yields  an  averaging  window  for 
<  x  >  that  is  exponential  in  shape,  whereas  a  rectangular 
window  is  a  mnch  better  approximation  to  the  recovery 
function.  Nonetheless,  the  value  for  Kagc  that  was  deter¬ 
mined  experimentally  to  best  match  auditory  data  is  .002.' 
This  value  corresponds  to  a  2-ms  lockont  period,  which  is 
a  little  long  but  at  least  the  correct  order  of  magnitude. 
Perhaps  a  more  realistic  model  for  the  refractory  effect 
that  would  be  appropriate  during  onsets  as  well  as  steady 
states  would  result  in  a  better  match  to  the  dynamics  of 
the  onset  envelope  response. 

It  is  still  premature  to  suggest  that  an  auditory- based 
speech  analysis  system  will  pay  off  in  speech  recognition. 
There  ire  emerging,  however,  strong  indications  that  auditory- 
bawd  representations  are  Interesting  and  worthy  of  further 
study.  Onset  and  offset  enhancement  properties  are  partic¬ 
ularly  effective  in  sharpening  segment  boundaries,  as  dis¬ 
cussed  in  [10|.  The  forward  masking  phenomenon  should 
be  effective  in  reducing  noise  in  stop  bursts  and  enhancing 
low/high  frequency  contrast  in  strong  fricatives.  We  are 
now  becoming  more  confident  in  the  validity  of  the  com¬ 
puter  models,  snch  that  th*y  may  reveal  interesting  effects 
in  auditory  speech  processing,  which  may  lead  the  way  to 
appropriate  later-stage  speech  rec  ognition  strategies. 
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ABSTRACT 

A  new  approach  to  vowel  recognition  is  described,  which 
begins  by  reducing  a  specti  /graphic  represent  stion  to  a  set  of 
straight-line  segments  that  collectively  sketch  out  the  formant 
trajectories.  These  ‘line-formants'  are  used  for  recognition  by 
scoring  their  match  to  a  set  of  histogram"  of  line-formant  fre¬ 
quency  distributions  determined  from  training  data  for  the  16 
vowel  categories  in  the  recognition  set.  Speaker  normalization  is 
done  by  subtracting  fo  from  line-formant  frequencies  on  a  Bark 
scale.  Although  the  formants  are  never  enumerated  or  tracked 
explicitly,  the  frequency  distributions  of  the  formants  are  the 
main  features  influencing  the  recognition  score.  Recognition  re¬ 
sults  ire  given  for  2135  vowels  extracted  from  continuous  speech 
spoken  by  292  male  and  female  speakers. 


INTRODUCTION 

The  formant  frequencies  are  probably  the  most  impor¬ 
tant  information  leading  to  the  recognition  of  vowels,  as 
well  as  other  sonorant  and  even  possibly  obstruent  sounds. 
Therefore,  researchers  have  spent  a  considerable  amount 
of  effort  designing  robust  formant  trackers,  which  attempt 
to  associate  peaks  in  the  spectrum  with  mant  frequen¬ 
cies,  using  continuity  constraints  to  aid  in  .-e  tracking  of 
the  formants.  Once  the  formant  tracks  are  available,  it 
then  becomes  possible  to  identify  directions  and  degree  of 
formant  movements,  features  that  are  important  in  recog¬ 
nizing  diphthongs,  semivowels,  and  place  of  articulation  of 
adjacent  conixmants. 

It  is  impossible  to  design  a  “perfect”  formant  tracker. 
The  moat  serious  problem  with  formants  is  that  when  they 
are  wrong  there  are  often  gro^s  errors.  Therefore,  we  have 
decided  to  adopt  a  somewhat  different  approach,  one  that 
can  lead  to  informatio.  ..bout  formant  movements  with¬ 
out  explicitly  labelling  the  formant  numbers.  The  method 
also  collapses  the  iwo  stages  of  formant  tracking  and  track 
interpretation  (e.g.,  “rising  formant")  into  a  single  step. 

’This  research  was  supported  by  DARP  \  under  Contract  N0n039-85- 
C-0254,  monitor-d  through  Naval  Electronic  Systems  Command. 


The  outcome  is  that  a  spactrographic  reprasentation  is  re¬ 
duced  to  a  skeleton  sketch  consisting  of  a  -set  of  straight- 
line  figments,  which  we  call  “line-formants,”  that  collec¬ 
tively  trace  out  the  formant  tracks.  The  recognition  strat¬ 
egy  then  involves  matching  all  of  the  line-formants  of  an 
unknown  segment  to  a  set  of  templates,  each  of  which 
describes  statistically  the  appropriate  line-formant  config¬ 
urations  for  a  given  phonetic  class  (which  could  be  as  de¬ 
tailed  as  “nasalized  /*/”  or  as  general  a*  “front  vowel”). 
Usually  the  number  of  line-formants  for  a  given  speech  seg¬ 
ment  is  considerably  larger  than  the  number  of  formants, 
because  in  many  cai.es  several  straight-line  segments  are 
required  to  adequately  reflect  the  transitions  of  a  single 
formant. 


SIGNAL  PROCESSING 

Spectral  Representation 

The  system  makes  use  of  two  spectrogram-like  repre¬ 
sentations  that  are  bas«d  on  our  current  understanding 
of  the  human  auditory  system  |l].  The  analysis  system 
consists  of  a  set  of  40  critical  band  filters,  spanning  the 
frequency  range  from  160  to  6400  Hz.  The  filter  out¬ 
puts  are  processed  through  a  nonlinearity  stage  that  intro¬ 
duces  such  effects  as  onset  enhancement,  saturation  and 
forward  marking.  This  stage  is  described  in  detail  in  a 
companion  paper  (2).  The  outputs  of  this  stage  are  pro¬ 
cessed  through  two  independent  analyses,  each  of  which 
produces  a  spectrogram-like  output.  Ths  “M-an  Rate 
Spectrogram"  is  related  to  mean  rate  response  in  the  au¬ 
ditory  system,  and  is  used  for  locating  sonorant  regions 
in  the  .speech  signal.  The  “Synchrony  Spectrogram”  takes 
advantage  of  the  phase-locking  property  of  auditory  n«rve 
fibers.  It  produces  spectra  that  tend  to  he  amplitude 
normelized,  with  prominent  peaks  at  the  formant  frequen¬ 
cies..  The  amplitude  of  each  spectral  peak  is  related  to  the 
amount  of  energy  at  that  frequency  relative  to  the  energy 
in  the  spectral  vicinity.  The  line-formant  representation  is 
derived  from  this  Synchrony  Spectrogram. 
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Line* formant  Proce  »ing 

The  line-formants  are  obtained  by  first  locating  sono- 
rant  regions,  Has ed  on  the  amount  of  low  frequency  en.-rgy 
in  the  Mean  Rate  Spectrogram.  Within  these  sonorant  re¬ 
gions,  a  subset  of  robust  peaks  in  the  Synchrony  Spectro¬ 
gram  is  selected.  Peaks  are  rejected  if  their  amplitude  is 
not  sufficiently  greater  than  the  average  amplitude  in  the 
surrounding  time-frequency  field.  For  each  selected  p.iak, 
a  short  fixed-length  line  segment  is  determined,  whose  di¬ 
rection  gives  the  best  orientation  for  a  proposed  formant 
track  passing  through  that  peak,  using  a  procedure  as  out¬ 
lined  in  Figure  1.  The  amplitude  at  each  point  on  a  rect¬ 
angular  grid  within  a  circular  region  surrounding  the  peak 
in  question  is  used  to  update  a  histogram  of  amplitude 
as  a  function  of  the  angle,  9.  Typicrl  sizes  for  the  circle 
radius  are  20  ms  in  time  and  1.2  Bark  in  frequency.  The 
maximum  value  in  the  histogram  defines  the  amplitude 
and  corresponding  9  for  the  proposed  track,  as  marked  by 
an  aiTuW  in  Figure  lc. 

At  each  time  frame  several  new  short  segments  are 
generated,  one  for  each  robust  spectral  peak.  A  short 
segment  is  then  merged  with  a  pre-exist  :ng  partial  line- 
formant  whenever  the  two  lines  have  a  sin  ilar  orientation, 
and  the  distance  between  each  endpoint  r-d  the  other  line 
is  sufficiently  small.  The  merging  pro '.ess  is  accomplished 
by  creating  a  weighted-average  line-fonnant  that  incorpo¬ 
rated  the  new  line.  If  a  given  new  segment  is  sufficiently 
unique,  it  is  entered  as  a  new  partial  line-formant. 


sual  evaluation.  The  latter  is  constructed  by  replacing 
each  line-formant  with  a  tim-  sequence  of  Gau^sian-shaped 
spectral  peaks  with  amplitude  equal  to  the  line’s  ampli¬ 
tude.  The  corresponding  Synchrony  Spectrogram  is  shown 
in  Figure  2c,  with  line-formants  superimposed.  For  direct 
comparison,  Figure  2d  shows  a  Synchrony  Spectral  crose 
section  at  the  time  of  the  vertical  bar,  on  which  is  super¬ 
imposed  a  cross  section  of  the  Schematized  Spectrogram. 
For  this  example,  we  see  that  peak  locations  and  amlitudes 
in  the  vowel  are  accurately  reflected.  In  addition,  formant 
transitions  appropriate  for  the  palatal  fricative  on  the  left 
and  the  velar  atop  on  the  right  are  also  captured. 

RECOGNITION  EXPERIMENT 

Thus  far,  we  have  focused  our  studies  on  speaker-inde¬ 
pendent  recognition  for  16  vowels  and  diphthongs  of  Amer¬ 
ican  English  in  continuous  jpeech,  restricted  to  obstruent 
and  najal  context.  The  semivowel  context  is  excluded  be¬ 
cause  we  believe  that  in  many  cases  vowel-semivowel  se¬ 
quences  jhould  be  treat.ed  as  a  single  phonetic  unit  much 
like  a  diphthong. 

Speaker  Normalisation 

Our  first  task  was  to  devise  an  effective  speaker- nor¬ 
malization' procedure.  Many  investigators  have  noted  the 
strong  correlation  between  formant  frequencies  and  Fa  3|. 
The  relationship  is  clearly  nonlinear  -  the  second  formant 


The  resulting  Skeleton  Spectrogram  for  the  /a/  in  the 
word  "shock”  is  illustrated  in  Figure  2  a,  along  with  a  Schema¬ 
tized  Spectrograr\  in  Figure  26,  included  to  facilitate  vi- 


Flgure  1:  Schematic  illustration  of"  process  used  to  deter¬ 
mine  an  orientation  for  a  rormant  passing  through  a  peak,  (a) 
Synchrony  Spectrogram  with  cross-bars  indicating  a  referenced 
peak,  (b)  Schematic  blow-up  of  region  around  the  peak,  outlin¬ 
ing  procedure  to  generate  a  histogram  of  amplitude  as  a  function 
of  angle,  (c)  Resulting  histogram  for  the  example  in  part  a. 
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Figure  2:  Sample  ime-form.ant  outputs:  (a)  Skeleton  Spectro¬ 
gram  for  word  ‘shock."  (b)  Corresponding  Schematized  Spec- 
troTam  (c)  Synchrony  Spectrogram  with  line-formants  super¬ 
imposed,  (d)  cross-sections  from  b  and  c  at  the  cursor  super¬ 
imposed. 


for  female  /i/  is  higher  on  average  by  several  hundred  Hi, 
whereas  the  F0  difference  is  on  the  order  of  100  Hz.  How¬ 
ever,  on  a  Bark  (critical  band)  scale  the  male-female  dif¬ 
ference  in  fj  for  /i/  becomes  much  more  similar  to  that  in 
F0.  Thus  we  decided  to  try  a  very  uimple  scheme  -  for  each 
line-formant,  subtract  from  the  line's  center  frequency  the 
median  F»  over  the  duration  of  the  line,  on  a  Bark  scale. 

We  found  thL  normalization  procedure  to  be  remark¬ 
ably  effective,  as  illustrated  in  Figure  3.  Part  a  shows  a 
histogram  of  the  center  frequencies  of  all  of  the  lines  for 
35  male  and  35  female  /m/  tokens.  Part  b  shows  the  same 
data,  after  median  F0  has  been  subtracted  from  each  line's 
center  frequency.  The  higher  formants  emerge  as  separate 
entities  after  the  F0  norma’ization.  The  normalization  is 
not  as  effective  for  Flt  but  the  dispersal  in  Fi  is  due  in 
part  to  other  factors  such  as  vowel  nasalization. 

A  valid  question  to  ask  is  the  following:  if  it  is  sup¬ 
posed  that  jpeaker  normalization  can  be  accomplished  by 
subtracting  a  factor  times  F0  from  all  formant  frequen¬ 
cies,  then  what  should  be  the  numerical  value  of  the  fac¬ 
tor?  An  answer  can  be  obtained  experimentally  using  au¬ 
toregressive  analysis.  We  defined  F„  =  Fn  -  aF0  to  be 
the  normalized  formant  frequency  for  each  line.  Using 
vowels  for  which  the  formants  are  well  separated,  we  as¬ 
sociated  a  group  of  lines  with  a  particular  formant  such 
as  Fs.  The  goal  was  to  minimize  total  squared  error  for 
■  each  remapped  formant  among  all  speakers,  with  respect 
to  a.  The  resulting  estimated  value  for  a  was  0.975,  pro¬ 
viding  experimental  evidence  for  the  validity  of  the  pro¬ 
posed  scheme. 


Figure  3:  Histograms  for  renter  frequencies  of  all  line-formants 
for  35  female  and  35  male  tokens  of  /«/,  (a)  without  Fo  nor¬ 
malization.  and  (b)  with  Fo  normalization. 


Scoring  Procedure' 

Our  goal  in  developing  a  recognzer  for  tht  vowels  was 
to  emphasize  the  formant  frequency  information  without 
ever  explicitly  identifying  the  formant  numbers.  We  wanted 
to  avoid  traditional  spectral  template-matching  schemes, 
because  they  depend  too  heavily  on  irrelevant  factors  su-.h 
a.  the  loudness  or  the  overall  spectral  tilt.  On  the  other 
hand,  we  did  not  want  to  specify,  for  example,  the  dis¬ 
tance  between  Fs  and  a  target  Fj,  because  this  relied  on 
accurately  enumerating  the  formants. 

We  decided  to  construct  histogram  of  frequency  dis¬ 
tributions  of  spectral  peaks  across  time,  based  on  data 
derived  from  the  line- formants.  The  scoring  amounts  to 
treating  each  histogram  as  a  probability  distribution,  and 
matching  the  unknown  token's  line-formants  against  the 
appropriate  distributions  for  each  vowel.  To  construct  the 
histograms  for  a  given  vowel,  all  of  the  lins-formants  in  a 
training  set  were  used  to  gen  irate  five  histograms  intended 
to  capture  the  distributions  of  the  formants  at  significant 
time  points  in  the  vowel.  All  lines  were  normalized  with 
respect  to  F0,  which  was  computed  automatically  using  a 
version  of  the  Gold-Rabiner  pitch  detector  [4).  Each  line- 
formant's  contributions  to  the  histograms  were  weighted 
by  its  amplitude  and  its  length. 

Only  left,  center  and  right  frequencies  of  the  lines  were 
used  in  the  hLtogram..  The  left  frequency  of  a  given  line- 
formant  falLs  into  one  of  two  bins,  depending  upon  whether 
or  not  it  is  near  the  beginning  of  the  vowjI.  Right  frequen¬ 
cies  are  sorted  similarly,  with  a  dividing  point  near  the 
end  of  the  vowal.  Center  frequencies  are  collected  into  the 
same  histogram  r.gardl_ss  of  their  time  location.  Such  a 
,  orting  process  results  in  a  et  of  histograms  that  reflects 
general  formant  motions  over  time.  For  example,  the  F; 
peak  in  the  histograms  for  /e/  shifts  upward  from  left-on- 
left  to  center  to  right-on-right,  reflecting  the  fact  that  jet 
is  diphthongized  towards  a  /y/  off-glide,  as  illustrated  in 
Figure  4. 

To  score  ^  unknown  token,  the  left,  center,  and  right 
frequencies  of  all  of  its  lines  are  matched  against  the  ap¬ 
propriate  histograms  for  each  vowel  category,  which  are 
treatid  as  probability  distributions.  The  jeore  for  the  to¬ 
ken’s  match  is  the  weighted  jum  of  the  log  probabilites  for 
the  five  categories  for  ali  of  the  line-formants.  The  ampli¬ 
tude  of  the  line  does  not  enter  into  the  match,  but  is  used 
only  as  a  weight  for  tho  line’s  contribution  to  the  score. 
This  strategy  eliminates  the  problem  of  mismatch  due  to 
fact  such  as  spectral  tilt  or  overall  energy. 

Recognition  Results 

The  vowels  used  for  recognition  were  extracted  from 
sentences  in  the  TIMIT  database  [5).  The  speaKerr  rep¬ 
resented  a  wide  range  of  dialectical  variations.  A  total 
of  2135  vowel  tokens  spoken  by  20G  male  and  82  female 
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Table  1:  Distributions  of  vowels  in  recognition  experiment 

speakers  were  used  as  both  training  and  test  data,  using 
a  jackknifing  procedure.  The  distributions  of  vowels  are 
shown  in  Table  1.  Each  speaker’s  vowel  tokens  were  scored 
against  histograms  com  mt  d  from  all  of  the  line-formants 
except  those  from  the  >  speaker.  The  scoring  procedure 
was  as  discussed  above,  with  histograms  d : fined  for  six¬ 
teen  vowel  categories.  The  endpoints  for  the  vowels  were 
taken  from  the  time-aligned  phonetic  transcription. 

A  matrix  of  first-choice  confusion  probabilities  is  given 
in  Table  2,  in  terms  of  percent  correct  in  the  phonetic 
category.  For  the  most  part,  confusions  are  reasonable. 
We  feel  encouraged  by  this  performance,  especially  con¬ 
sidering  that  multiple  dialects  and  multiple  contexts  are 
included  in  the  same  histogram. 

Figure  5  summarizes  recognition  performance  in  terms 
of  percentage  of  time  the  correct  answer  ir  in  the  top  N, 
for  all  speakers,  and  for  male  and  female  speakers  sepa¬ 
rately.  Recognition  was  somewhat  worse  for  females,  who 
represented  only  25%  of  the  population.  Alsc  shown  are 
the  recognition  results  for  female  speakers  when  the  Fr 
normalization  scheme  is  omitted,  both  in  collecting  the 
histograms  and  in  scoring.  Significant  gains  were  realised 
as  a  consequence  of  the  normalisation.  The  performance 
for  the  male  speakers  without  Ft  normalization  however 
(not  shown)  did  not  change. 
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Figure  4i  Histograms  for  (s)  left-on-left,  (b)  center,  and  (c) 
right-on-right  Une-formassS  frequencies  for  12a  tokens  of  /e/,  Fa  not- 
m  allied. 
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Figure  Si  Recognition  resnlti  expressed  is  percent  of  time  correct 
choice  ts  in  lop  N,  for  the  following  conditions:  (•)  ill  speakers,  (b) 
males  only,  (c)  females  only,  and  (d)  females  without  Fa  oormaiiz  lion. 


Table  2:  First  choice  '•onfusion  matrix  for  the  vowels 
Row  =  Labeled  Category,  Column  =  R  ’cogniz 'd  Category. 


FUTURE  PLANS 

We  believe  that  recognition  performance  can  be  im¬ 
proved  by  extensions  in  several  directions.  One  is  to  divide 
each  vowel’s  histograms  into  multiple  subcategories,  ba^ed 
on  both  general  features  of  the  vowel  and  coarticulation  ef¬ 
fects.  General  categories,  useful  for  the  r-nter-frequency 
histogram,  would  include  “na&alized,”  “Southern  accent," 
or  “fronted.”  Left-  and  right-context  place  of  articulation, 
such  as  “velar,"  could  be  used  to  define  corresponding  his¬ 
togram  subcat  jgories.  We  also  plan  to  explore  an  alterna¬ 
tive  recognition  strategy  for  explicitly  matching  each  Itne- 
formant  against  a  set  of  template  lint-formanta  describing 
a  particular  phonetic  category,  instead  of  reducing  the  line 
to  three  “independent”  points.  We  believe  that  such  an 
approach  will  better  capture  the  fact  that  a  given  left  fre¬ 
quency  and  a  given  right  frequency  are  connected.  Finally,  , 
we  plan  to  gradually  expand  the  scope  of  the  recognizer, 
first  to  vowels  in  all  contexts  and  then  to  other  classes  such 
as  emivowels. 
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ABSTRACT 

As  part  of  our  goal  to  better  understand  the  relationship 
between  the  speech  signal  and  the  underlying  phonemie  rep¬ 
resentation*  we  hav'  developed  a  procedure  that  describes  the* 
acoustic  structure  01  the  >gual,  and  have  determined  an  acous¬ 
tically  motivated  set  of  broad  classes.  Acoustic  events  are  em¬ 
bedded  in  a  multi-level  structure,  in  which  information  ranging 
from  eoarse  to  fine  is  represented  in  an  organised  fashion.  An 
an.tlysij  of  the  acoustic  structure,  using  500  utterances  from  100 
different  talkers,  shows  that  it  captures  over  94%  of  the  acoustic- 
phonetic  events  of  interest  with  an  insertion  rate  of  less  than 
8%.  Aeouatie  elassifw  ition  is  accomplished  urng  a  hierarchical 
clustering  technique.  Our  evaluations  of  the  results  show  that 
with  a  small  number  of  clusters,  we  are  able  to  obtain  a  robust 
description  of  the  speech  signal  and  to  provide  a  meaningful 
acoustie-phonetie  interpretation. 


INTRODUCTION 

The  task  of  phonetic  recognition  can  be  stated  broadly 
as  the  determination  of  a  mapping  of  the  acoustic  rignal 
to  a  set  of  phonological  units  (e.g.,  distinctive  feature  bun¬ 
dles,  phonemes,  or  syllables)  used  to  represent  the  lexicon. 
In  order  to  perform  such  a  mapping,  it  is  often  desirable 
to  first  transform  the  eontinmm  speech  signal  into  a  dis¬ 
crete  set  of  segments.  Typically,  this  segmentation  process 
is  followed  by  a  labeling  process,  in  which  the  segments 
are  assigned  phonetic  labels.  While  this  procedure  is  con¬ 
ceptually  straightforward,  its  implementation  has  proved 
to  be  immensely  difficult  [4j.  Our  inability  to  achieve 
high-performance  phonetic  recognition  is  largely  due  to 
the  diversity  in  the  acoustic  properties  of  speech  sounds. 
Stop  consonants,  for  .xample,  are  produced  with  abrupt 
changes  in  the  vocal  tract  configuration,  resulting  in  dis¬ 
tinct  acoustic  landmarks.  Semivowels,  on  the  other  hand, 
are  produced  with  considerably  slower  articulatory  move¬ 
ments,  and  the  associated  acoustic  transitions  are  oftun 
quit;  objeure.  To  complicate  matters  further,  the  acous¬ 
tic  properties  of  phonemes  change  as  a  function  of  context, 
and  the  nature  of  juch  contextual  variation  is  still  poorly 

"This  res. sit b  was  supported  by  D  iRPA  under  Contract  N00039-85- 
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understood.  A»  a  result,  the  development  of  algorithms 
to  locate  and  classify  these  phonsmes-in-context,  or  alio- 
phones,  typically  requires  intense  knowledge  engineering. 

We  are  presently  expl  iring  a  somewhat  different  ap¬ 
proach  to  phonetic  recognition  in  which  the  traditional 
phonetic-level  description  is  bypassed  in  favor  of  directly 
relating  the  acoustic  realisations  to  the  underlying  phone¬ 
mic  forms.  Our  approach  is  motivated  by  the  observation 
that  a  description  based  on  allophones  is  both  incomplete 
and  somewhat  arbitrary.  Phonetician,  traditionally  iden¬ 
tify  a  certain  number  of  important  allophones  for  a  given 
phoneme  based  on  their  examination  of  a  limited  amount 
of  data  together  with  introspective  reasoning.  With  the 
availability  of  a  large  body  of  data  [5|,  we  are  now  in  a 
position  to  ascertain  whether  these  categories  are  acous¬ 
tically  meaningful,  and  whether  additional  categories  will ' 
emerge.  Rather  than  describing  the  acoustic  variations  in 
terms  of  a  set  of  preconceived  units,  i.e.  allophones,  we 
would  like  to  let  the  data  help  us  discover  important  reg¬ 
ularities.  In  this  line  of  investigation,  the  speech  signal 
is  transformed  into  a  set  of  acoustic  segments,  and  the 
relationship  between  these  acoustic  segments  and  the  un¬ 
derlying  phonemic  form  is  described  by  a  grammar  which 
will  be  determined  through  a  set  of  ‘raining  data. 

This  paper  describes  some  recent  work  in  acoustic  seg¬ 
mentation  and  classification,  as  part  of  the  develooment 
of  a  phonetic  recognition  system.  Ideally,  we  would  like 
our  system  to  have  the  following  set  of  properties.  The 
segmentation  algorithm  should  be  able  to  reliably  detect 
abrupt  acoustic  events  such  as  a  stop  burst  and  gradual 
events  such  as  a  vowel  to  semivowel  transition.  More  im¬ 
portantly,  there  must  exist  a  coherent  framework  in  which 
acoustic  changes  from  coarse  to  fine  can  be  expressed.  The 
classification  algorithm  should  produce  an  accurate  de¬ 
scription  of  the  acoustic  events.  Phonemes  that  are  acous¬ 
tically  similar  should  fall  into  th .  same  da  3.  If,  on  the 
other  hand,  a  phoneme  falls  into  more  than  one  acoustic 
class,  then  the  different  acoustic  realizations  should  sug¬ 
gest  the  presence  of  important  contextual  variations. 


ACOUSTIC  SEGMENTATION 

The  purpose  of  our  acoustic  segmentation  is  to  delin¬ 
eate  the  speech  signal  into  segments  that  are  acoustically 
homogeneous.  Realizing  the  need  to  describe  varying  de¬ 
grees  of  acoustic  similarity,  we  have  adopted  a  multi-level 
representation  in  which  segmentations  of  different  sensi¬ 
tivities  are  structured  in  an  organized  fashion. 

Determining  Acoustic  Segment- 

The  algorithm  used  to  establish  acoustic  segments  is 
a  simplified  version  of  the  one  we  developed  to  detect 
nasal  consonants  in  continuous  speech  [2).  This  algorithm 
adopts  the  strategy  of  measuring  the  similarity  of  each 
frame  to  its  near  neighbors.  Similarity  is  computed  by 
measuring  the  Euclidean  distance  betw«en  the  spectral 
vector  ">f  a  given  frame  and  the  two  frames  10  ms  away. 
Movi  >n  a  frame-by-frame  basis  from  left  to  right,  the 
algorithm  associates  each  frame  in  the  direction,  past  or 
future,  in  which  the  similarity  is  gr»ater.  Acoustic  bound¬ 
aries  are  marked  whenever  the  association  direction  switches 
from  past  to  future.  By  varying  the  parameters  of  this  pro¬ 
cedure,  we  are  able  to  control  its  sensitivity  in  detecting 
acoustic  segments  in  the  speech  signal.  We  have  chosen  to 
operate  with  a  low  deletion  rate  because  mechanisms  exist 
for  us  to  combine  segments  if  necessary  at  a  later  stage. 

Signal  Representation 

The  algorithms  for  both  acoustic  segmentation  and  clas¬ 
sification  use  the  output  of  an  auditory  model  proposed 
by  Seneff  [7].  The  model  incorporates  known  properties 
of  the  human  auditory  system,  such  as  critical-band  fil¬ 
tering,  half-wave  rectification,  adaptation,  saturation,  for¬ 
ward  masKing,  spontaneous  response,  and  synchrony  de¬ 
tection.  The  model  consists  of  40  filters  equally  spaced  on 
a  Bark  frequency  scale,  spanning  a  frequency  range  from 
130  to  0,400  Hz.  For  our  application,  we  use  the  output  of 
the  filter  channels  after  they  have  been  processed  through 
a  hair-cell/synapse  transduction  stage.  The  envelope  of 
the  resulting  channel  outputs  corresponds  to  the  “mean 
rate  responst”  of  the  auditory  nerve  fibers.  The  outputs 
are  represented  as  a  40-dimensional  feature  vector,  com¬ 
puted  once  every  5  ms. 

We  find  this  representation  desirable  for  several  rea¬ 
sons.  The  transduction  stage  tends  to  enhance  the  onsets 
and  offsets  in  the  critical-band  channel  outputs.  Forward 
masking  will  greatly  attenuate  many  low  low-amplitude 
sounds  because  ’he  output  fails  below  the  spontaneous  fir¬ 
ing  rate  of  the  nerve  fibers.  These  two  effects  combine  to 
sharpen  acoustic  boundaries  in  the  speech  signal.  Further¬ 
more,  due  to  the  saturation  phenomena,  formants  in  the 
envelope  response  appear  as  broad-land  peaks,  obscur¬ 
ing  detailed  differences  among  similar  sounds,  an  effect  we 
believe  to  be  advantageous  for  grouping  similar  sounds. 


In  a  series  of  experiments  comparing  various  signal  repre¬ 
sentations  for  acoustic  segmentation,  we  found  that,  over 
a  wide  range  of  segmentation  sensitivities,  the  auditory- 
based  representation  consistently  produced  the  least  num¬ 
ber  of  insertion  and  deletion  errors  3 

Multi  sieve!  Description 

Ou.  it  experience  with  acoustic  segmentation  led  us 
to  the  conclusion  that  there  exists  no  single  level  of  seg¬ 
mental  representation  that  can  adequately  describe  all  the 
acoustic  events  of  interest  As  a  result,  we  have  adopted 
a  multi-level  representation  similar  to  the  scale-3pace  pro¬ 
posal  by  Witkin  [9],  We  find  this  representation  attractive 
because  it  is  able  to  capture  both  coarse  and  fine  informa 
tion  in  one  uniform  structure  Acoustic-phonetic  analysis 
can  then  be  formulated  as  a  path  finding  problem  in  a 
highly  constrained  search  space. 

The  procedure  for  obtaining  a  multi-level  representa¬ 
tion  is  similar  to  that  us»d  for  finding  acoustic  segments. 
First,  the  algorithm  uses  all  of  the  proposed  segment  as 
“seed  regions”.  Next,  each  region  is  associated  with  ei 
ther  its  left  or  right  neighbor  using  a  similarity  measure 
When  two  adjacent  regions  associate  with  each  other,  they 
are  merged  together  to  form  a  single  region.  Similarity  is 
computed  with  a  weighted  Euclidean  distance  measure  ap¬ 
plied  to  the  average  spectral  vectors  of  each  region.  This 
new  region  subsequently  associates  itself  with  one  of  its 
neighbors.  The  merging  process  continues  until  the  entire 
utterance  is  described  by  a  single  acoustic  event.  By  keep¬ 
ing  track  of  the  distance  at  which  two  regions  merge  into 
one,  the  multi-level  description  can  be  displayed  in  a  tree¬ 
like  fashioD  as  a  dendrogram,  as  illustrated  in  Figure  1  for 
the  utterance  “Coconut  cream  pie  makes  a  nice  dessert” 
From  the  bottom  towards  the  top  of  the  dendrogram  the 
acoustic  description  varie.1  from  fine  to  coarse  Tne  re¬ 
lease  of  the  initial  /’ k /' ,  for  example,  may  be  considered  to 
be  a  single  acoustic  event  or  a  combination  of  two  events 
(release  plus  aspiration)  depending  on  the  level  of  detail 
desired. 

We  found  this  procedure  to  be  more  attractive  than 
the  scale-space  representation  which  we  and  others  have 
investigated  [G,8].  The  scale-space  procedure  produces  a 
multi-level  description  by  uniformly  increasing  tne  scale 
through  lowpass  filtering,  without  regard  to  local  context 
As  a  result,  at  low  scales  It  tends  to  eliminate  short  but 
distinct  acoustic  even’s  such  as  stop  releases  and  flaps  In 
contrast,  our  procedure  merges  regions  using  a  local  sim¬ 
ilarity  measure.  As  a  result,  regions  that  are  acoustically 
distinct  are  typically  preserved  higher  in  the  dendrogram, 
regardless  of  their  duration.  Finally,  by  representing  each 
region  by  a  single  average  spectral  vector  our  procedure 
is  computationally  more  efficient. 

Evaluation 

We  have  evaluated  the  effectiveness  of  our  multi-level 
acoustic  representation  in  several  ways.  First,  we  devel- 
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Figure  1:  Multi-level  Acoustic  Segmentation. 


oped  an  algorithm  to  automatically  find  the  path  through 
the  dendrogram  which  best  matched  a  time-aligned  pho¬ 
netic  transcription.  An  example  of  such  a  path  is  high¬ 
lighted  on  the  dendrogram  in  Figure  1.  The  boundaries 
along  this  path  are  also  marked  by  vertical  lines  in  the 
spectrogram.  We  then  tabulated  the  insertion  and  dele¬ 
tion  errors  of  these  paths.  Not  only  should  we  expect  a 
small  number  of  insertion  and  deletion  errors,  the  errors 
should  also  be  acoustically  reasonable.  Next,  we  compared 
the  time  difference  between  tho  boundaries  found  and  the 
actual  boundaries  as  provided  by  the  transcriptions.  Fi¬ 
nally,  we  examined  whether  correct  and  incorrect  bound¬ 
aries  behave  in  any  reasonable  way. 


aries  of  a  vowel.  In  Figure  1  there  was  an  insertion  between 
the  vowel  and  the  fricative  in  the  word  ‘nice’. 

Analysis  of  the  time  difference  between  the  boundaries 
found  and  those  provided  by  the  transcription  shows  that 
that  more  than  70%  of  the  boundaries  were  within  10  ms 
of  each  other,  and  more  than  90%  were  within  20  ms. 

Finally,  we  compared  the  boundary  heights  in  the  den¬ 
drogram  (as  measured  by  the  distance  at  which  the  region 
is  merged  with  one  of  its  neighbors)  of  valid  boundaries 
to  those  of  invalid  boundaries.  This  comparison  is  shown 
in  Figure  2.  The  valid  boundaries  are  typically  higher, 
suggesting  that  they  are  more  resilient  against  merging. 


The  evaluation  was  earned  out  using  500  sentences 
from  the  TIMIT  database  |5j:  five  sentences  each  from 
100  talkers  (69  male  and  31  female).  These  sentences 
contained  nearly  18,500  phon  s.  The  best-path  alignment 
procedure  gave  under  6%  and  8%  deletion  and  insertion  er¬ 
rors,  respectively.  Closer  examination  of  the  errors  reveals 
that  the  deletions  mostly  involve  acoustic  transitions  that 
are  not  always  distinct,  such  as  those  between  closures  and 
weak  stop  releases,  between  vowels  and  semivowels,  be¬ 
tween  nasals  and  voiced  closures,  and  between  stops  and 
fricatives.  In  Figure  1,  we  can  see  that  the  boundary  be¬ 
tween  the  stop  and  the  fricative  was  deleted  in  the  word 
‘makes’.  We  have  not  yet  analyzed  the  insertions  as  ex¬ 
haustively  as  the  deletions.  However,  it  appears  that  ap¬ 
proximately  half  of  the  insertions  occur  within  the  bound- 


Figure  2:  Histogram  of  Boundary  Height. 
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ACOUSTIC  CLASSIFICATION 

Once  a  signal  had  been  segmented,  each  region  in  the 
dendrogram  is  assigned  an  acoustic  label  using  a  pattern 
classification  procedure  described  in  this  section.  Ideally 
the  classification  procedure  should  group  similar  speech 
sounds  into  the  same  category,  and  separate  sounds  that 
are  widely  different.  While  we  did  not  know  how  many 
classes  would  be  appropriate,  we  suspected  that  the  num¬ 
ber  of  classes  would  be  small  so  that  the  results  would  be 
robust  against  contextual  and  extra-linguistic  variations. 

Hierarchical  Classification 

In  order  to  classify  the  acoustic  segments,  we  first  de¬ 
termined  a  set  of  prototype  spectral  templates  based  on 
training  data.  In  our  case  this  was  accomplished  by  using 
a  stepwise-optimal  hierarchical  clustering  procedure  (lj. 
This  technique,  which  is  conceptually  simple,  structures 
the  data  explicitly.  In  addition,  the  number  of  clusters 
need  not  be  specified  in  advance.  We  used  an  agglomer- 
ative,  or  bottom-up  procedure  because  the  merging  crite¬ 
rion  is  easier  to  define  than  the  splitting  criterion  needed 
for  the  divisive  procedure.  The  agglomerative  technique 
is  also  computationally  less  demanding  than  the  divisive 
technique.  Pilot  studies  performed  using  several  databases 
containing  many  talkers  indicated  that  the  agglomerative 
procedure  produced  relatively  stable  results,  provided  suf¬ 
ficient  training  data  is  available. 

In  the  interest  of  reducing  the  amount  of  necessary 
computation,  we  took  several  steps  to  reduce  the  size- of 
the  training  sample  used  in  the  hierarchical  clustering  pro¬ 
cedure.  First,  all  of  the  frames  w>thin  a  dendrogram  re 
gion  were  represented  by  a  single  average  spectral  vector. 
Second,  rather  than  using  all  regions  in  the  dendrogram 
for  training,  we  included  only  those  regions  that  the  path 
finding  algorithm  had  used  for  alignment  with  the  phonetic 
transcription.  These  two  steps  were  found  experimentally 
to  reduce  the  data  by  a  factor  of  fifteen,  with  no  noticeable 
degradation  in  clustering  performance. 

Further  data  reduction  was  achieved  by  merging  similar 
spectral  vectors  with  an  iterative  nearest  neighbor  proce¬ 
dure,  in  which  a  vector  is  merged  into  an  existing  cluster  if 
the  distance  between  it  and  the  cluster  falls  below  a  thresh¬ 
old.  Otherwise,  a  new  duster  is  formed  with  this  vector, 
and  the  procedure  repeats  In  the  end,  all  clusters  with 
membership  of  two  or  less  are  discarded,  and  the  data  are 
resorted  The  value  of  the  threshold  was  determined  ex¬ 
perimentally  from  a  subset  of  the  training  data,  and  was 
set  to  maximize  the  number  of  clusters  with  more  than 
two  members.  This  final  step  was  found  experimentally  to 
reduce  tiie  size  of  the  data  by  a  factor  of  th.rty. 

We  used  the  same  500  TIMIT  sentences  to  train  the 
classifier.  These  data  comprised  over  24  minutes  of  speech 
and  contained  over  200,000  spectral  frames.  Restriction  to 
♦he  time- aligned  dendrogram  regions  reduced  tne  data  to 
just  under  19,000  regions.  The  [  re-clustering  procedure  on 


these  regions  produced  SCO  clusters  covering  nearly  96% 
of  the  original  data.  All  of  the  data  then  were  resorted 
into  these  560  seed  clusters.  The  distribution  of  the  clus¬ 
ter  sizes  was  a  welldsehaved  exponential  function  with  a 
mode  of  6,  median  of  16,  and  an  averag"  size  of  34.  The 
b  erarchical  clustering  was  then  performed  on  these  seed 
clusters 

Cluster  Evaluation 

The  hierarchical  clustering  algorithm  arranges  the  clus¬ 
ters  in  a  tree-like  structure  in  which  each  node  bifurcates  at 
a  different  level.  The  experimenter  thus  has  the  freedom  to 
select  the  number  of  dusters  and  the  associated  spectral 
templates  for  pattern  classification.  We  have  performed 
several  types  of  analysis  to  help  us  make  this  decision. 

First,  the  set  of  clusters  should  be  acoustically  robust 
By  performing  the  clustering  experiment  on  several  databases 
and  examining  the  phonetic  contents  of  the  clusters,  we 
observed  that  the  top  three  or  four  levels  of  the  tree  struc¬ 
ture  are  quite  stable.  For  instarce,  the  top  two  clusters 
essentially  separate  all  consonants  from  vowels.  The  vowel 
cluster  subsequently  divides  based  on  spectral  shapes  cor¬ 
responding  to  different  corners  of  the  vowel  triangle.  The 
obstruent  cluster  divides  into  subgroups  such  as  silence, 
nasals,  and  fricatives.  From  these  observations  we  decided 
that  the  number  of  clusters  for  reliable  pattern  clacsifica 
tion  should  not  exceed  twenty. 

We  also  measured  the  average  amount  of  distortion  in¬ 
volved  in  sorting  the  training  set  into  a  given  set  of  clus¬ 
ters.  For  a  given  number  of  clusters,  the  set  with  the 
minimum  average  distortion  was  designated  as  the  best 
representation  of  the  data.  Figure  3  illustrates  the  rate  of 
decrease  in  the  average  distortion  xs  the  number  of  clus¬ 
ters  increases  from  one  to  twenty  From  this  plot  we  se“ 
that  the  most  significant  reductions  in  the  average  dis¬ 
tortion  occur  within  approximately  the  first  ten  clusters. 
Afterwards  the  rate  of  decrease  levels  off  to  around  1%. 


Figure  3:  Average  Distortion  versus  Number  of  Clusters. 
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Figure  4:  Phonetic  Hierarchical  Structure  with  Tea  Clusters. 


xr  il  y(lo  O0u|wltf xrtJO^AQ  acre 


iffil  y(lo  osu|wl«ftfrtJo^AQ  aar*;s 


V  p'tf  *k*  vUrrpngDfltn  ?ic  9  0f  tdbnphkg 


We  also  judged  the  relative  merit  of  a  3et  of  clusters  by 
acamining  the  distribution  of  phonetic  information  within 
each  jet.  This  was  done  by  performing  hierarchical  dus- 
tering  of  all  phones  using  their  distribution  across  the  set 
of  dusters  aj  a  feature  vector.  This  procedure  is  very  help-  M 
ful  in  facilitating  visualization  of  the  data  jtructure  cap-  "1; 
tured  by  a  set  of  dusters.  A  qualitative  analysis  of  these 
.structures  showed  that  after  ten  clusters  the  hierarchical 
organisation  did  not  change  significantly.  The  structure  |'- 
for  ten  clusters  is  shown  in  Figure  4. 

Finally,  we  compared  the  resulting  phonetic  distribu-  '  ~ 
tion  for  the  clusters  obtained  from  the  training  data  to  ... 
that  from  a  new  set  of  500  sentences  spoken  by  100  new  ••• 
speakers.  We  found  that  the  percentage  difference  for  a  j'.!. 
given  cluster  and  phoneme  is,  on  the  average,  around  196, 
suggesting  that  the  results  did  not  change  significantly. 
Closer  examination  reveals  that  the  larger  differences  are  |_i 
mostly  due  to  sparse  data.  _  I  •  - 

Ba*ed  on  the  results  of  these  analyses,  we  concluded  ’j]" 
tnat,  by  aelo'ting  approximately  ten  dusters,  we  are  able  •' 
'0  capture  a  large  amount  of  the  variability  in  the  data,  . 
as  well  as  a  large  amount  of  phonetic  information.  Fur-  '  *0 
thermore,  because  the  number  of  dusters  is  fairly  email, 
we  are  more  confident  that  thin  result  is  acoustically  ro¬ 
bust  in  the  face  of  contextual  and  extra-ungurtic  effects.-  •  * 
The  ten  dusters  produced  by  the  clustering  experiment 
are  illustrated  in  Figure  5. 


Figure  5;  Spectra  of  Ten  Clusters  Illusti&tin*  the  Avenge  ard 
Deviation. 


DISCUSSION 

Aconatie  Segmentation 

The  segmentation  algorithm  uses  relational  Snforma- 
tion  within  a  local  context.  A i  a  result,  we  believe  that  it  is 
lairly  insensitive  to  ixtra-iinguiitic  factorj  such  as  record¬ 
ing  conditions,  spectral  tilt,  long  term  amplitude  changes, 
and  background  noise.  Because  these  procedures  require 
no  training  of  any  kind  they  are  also  totally  speaker-inde¬ 
pendent.  In  the  future,  we  plan  to  -xamine  in  more  de¬ 
tail  the  behavior  of  this  algorithm  under  varying  recording 
conditions. 

The  results  of  our  experiment  on  acoustic  segmentation 
suggest  that  a  multi-level  representation  is  potentially  very 
useful.  The  combined  segment  insertion  and  deletion  rate 
of  14%  is  much  better  than  the  best  result  we  were  able 
to  obtained  previbusly  (25%)  with  a  single-level  represen¬ 
tation,  using  essentially  the  same  segmentation  algorithm 
and  signal  representation  [3].  Analysis  of  the  errors  indi¬ 
cates  that  most  of  the  deletions  occur  when  the  acoustic 
change  is  .ubtle.  When  a  boundary  is  inserted,  it  is  often 
the  case  that  significant  acoustic  change  exists,  such  as 
within  a  diphthong  or  between  the  frication  and  aspiration 
phases  of  stop  releases.  Since  our  objective  is  to  provide  an 
accurate  acoustic  description  of  the  signal,  some  of  these 
insertions  and  deletions  perhaps  should  not  be  counted  as 
errors. 

The  dendrogram  produces  valid  boundaries  as  well  as 
invalid  ones,  and  the  distributions  of  the  heights  for  these 
two  kinds  of  boundaries  are  well  separated,  as  shown  m 
Figure  2.  The  separation  becomes  even  mote  pronounced 
when  the  distributions  are  conditioned  on  the  general  con¬ 
text  of  the  boundary.  This  type  of  information  lends  itself 
naturally  to  a  probabilistic  framework  for  finding  the  best 
path  through  the  dendrogram. 

Acoustic  Classification 

We  are  also  very  encouraged  by  the  results  of  our  acous¬ 
tic  clas  ification  procedure.  It  appears  that  we  can  reliably 
assign  each  segment  to  one  of  a  small  set  of  acoustic  cat¬ 
egories,  each  having  a  meaningful  phonemic  distribution. 
In  other  words,  phonemes  that  are  acoustically  similar  by 
and  large  fall  into  the  same  acoustic  class.  As  a  result, 
we  believe  that  these  acoustic  labels  can  help  us  discover 
the  relationship  between  phonemes  and  their  acoustic  re¬ 
alizations.  For  example,  we  found  that  the  phoneme  / 5/ 
predominantly  falls  into  acoustic  categories  F  and  H  jhown 
in  Figure  5.  We  plan  to  examine  these  data  more  closely 
to  try  to  understand  the  context  in  which  each  of  these 
realizations  is  preferred. 

The  phonetic  I  cture  we  obtained  from  our  results  is 
also  attractive  because  it  provides  a  totally  acoustic  moti¬ 
vation  for  a  set  of  broad  class  is.  Previous  research  on  iexi 
cal  constraints  has  shown  that  knowledge  of  ths  broad  pho¬ 


netic  categories  of  the  phonemes  can  be  extrirneiy  helpful 
in  eliminating  unlikely  lexical  candidates  [10].  We  believe 
that  the  set  of  acoustic  labels  that  we  have  determined  can 
potentially  aid  in  the  recognition  of  phonemic  classes. 

SUMMARY 

In  summary,  we  have  reported  some  initial  work  with 
acoustic  segmentation  and  classification  which  we  belisve 
can  provide  a  foundation  for  an  eventual  phonetic  recog¬ 
nition  system.  By  representing  the  speech  signal  with  a 
multi-level  acoustic  description,  we  are  able  to  capture, 
and  to  organize  in  a  meaningful  fashion,  the  majority  of 
acoustic-phonetic  events  of  interest.  Our  work  with  acous¬ 
tic  classification  indicates  that,  with  a  small  number  of 
spectral  templates,  we  are  able  to  obtain  a  robust  descrip¬ 
tion  of  the  speech  signal,  and  also  to  provide  a  meaningful 
phonetic  interpretation.  In  the  futv  •  we  will  combine 
these  two  results  and  begin  to  describe  *n  more  detail  the 
relationship  between  the  acoustic  signal  and  the  underly¬ 
ing  phonemic  representation. 
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ABSTRACT 

The  RULE  software  system  is  a  series  of  tooi-  that  allows  one 
to  construct  recognition  lexicon-.  The  tools  run  on  a  Symbolics 
3600  computer,  and  allow  a  user  to: 

(1)  Easily  construct  and  test  linguistic  rule- 

(2)  Automatically  compile  and  apply  rules  to  probabilistic 
networks 

(3)  Graphically  display  pronunciation  networks  of  word-  or 
sentence- 

(4)  Observe  the  pit  lunciation  networks  as  they  ire  modified 
by  the  linguistic  rules 

(5)  Transcribe  speech  by  selecting  one  of  the  possible  paths 
through  a  pronunciation  network 

(6)  Test  if  a  set  of  phonological  rules  can  explain  observed 
forms 

A  previous  paper  (Bernstein  etal.  1986  DARPA  Speech 
Recognition  Workshop]  described  an  earlier  version  of  these 
tools.  This  paper  describes  new  algorithms  that  apply 
phonological  rules  to  pronunciation  networks.  Significant 
recent  developments  in  RULE  include:  (1)  phonological  rules 
are  applied  to  probabilistic  pronunciation  network-,  and  (2) 
generation  of  interword  phonological  effects  when  phonological 
rules  are  applied  to  individual  word  models  in  a  lexicon. 


1.  INTRODUCTION 

The  object  of  this  research  is  to  construct  recognition  lexicons 
that  can  be  used  in  a  speaker- independent  continuous-speech 
recognition  system.  Each  word  m  the  vocabulary  is  t  modeled  by 
a  separate  probabilistic  pronunciation  network  The  set  of  all 
pronunciation  networks,  and  the  algorithms  that  dei-rmine 
which  paths  of  one  pronunciation  network  can  follow  which 
paths  of  a  different  pronunciation  network  constitute  the 
recognition  lexicon. 

A  recognition  lexicon  should  consist  of  probabilistic  pronuncia¬ 
tion  networks  that  accurately  model  the  variations  in  phonetic 
pronunciations  observed  in  continuous  speech.  A  pronunciation 
network  of  a  particular  sentence  ( 1 )  should  contain  all  allowable 
pronunciations  of  that  sentence,  (2)  should  not  contain  any 
pronunciations  that  are  unreasonable,  and  (3)  sUi  .d  contain 
probabilities  that  accurately  reflect  the  true  pronunciation 
probabilities. 

The  RULE  system  was  designed  to  generate  pronunciation 
networks  by  applying  a  set  of  lexical  rules  to  a  baseform 
network.  In  an  earlier  version  of  the  RULE  system  (described 
in  the  paper  presented  at  last  years  DARPA  meeting),  when  the 


phonological  rules  were  applied  to  the  baseform  networks  of  a 
single  word,  the  resulting  network  described  the  possible 
word- internal  pronunciations  of  that  single  word.  When  the 
phonological  rules  were  applied  to  the  baseform  networks  of 
whole  sentences,  the  resulting  network  described  the  variations 
in  pronuneiati*  n  of  that  whole  sentence.  Since  the  phonological 
rules  were  applied  to  a  known  sentence,  particular  word  context 
effects  were  easily  handled.  We  have  extended  the  RULE 
system  to  apply  a  set  of  linguistic  rules  to  individual  word 
haseforms,  mid  generate  a  pronunciation  network  of  that  word 
that  represents  all  .significant  interword  effects.  We  have  also 
extendiad  the  RULE  system  to  apply  probabilistic  phonological 
rules  to  a  pronunciation  network.  The  new  algorithms  keep 
track  of  the  pronunciation  prob  •billne*  and  the  identities  of  the 
rules  that  generated  which  pronunciation  paths.  We  will 
describe  these  facilities,  as  well  as  the  algorithms  that  can  be 
used  to  automatically  train  the  probabilities  of  a  set  of 
phonological  rules. 

In  addition  to  the  above  extensions  of  RULE,  other  recent 
developments  include:  (1)  a  set  of  algorithms  to  convert  a 
pronunciation  network  into  a  minimum  determini  stic  network, 
and  (2)  an  improved  interactive  graphical  display  to  manipulate 
and  inspect  networks.  These  algorithms  will  not  be  described  in 
this  paper. 

2.  DEFINITIONS 

A  pronunciation  network  is  a  directed  graph  that  RULE 
represents  as  a  list  of  nodes  and  ires.  Each  network  ha-  a  single 
start-nodr  and  a  single  end-node.  Each  path  through  the 
network  (from  the  start-node  to  the  end-node)  represents  a 
possible  pronunciation  of  a  word  or  sentence.  Since  the 
pronunciation  network  does  not  contain  loops,  each  network 
contains  a  finite  number  of  different  pronunciations. 

The  arcs  of  a  network  contain  all  the  relevant  information  about 
the  allowable  pronunciation-.  Each  arc  contain-  the  following 
information: 

(1)  Arc-Label:  (e.g.  T  "D" 'TT  NULL- ARC)  All  the 

arc  labels  in  the  final  version  of  a  pronunciation  network 
(after  the  rules  have  been  applied)  correspond  to  specific 
phonetic  events.  As  we  have  implemented  our  rule  set, 

ndary  abels  such  as  and  computational 

constructs  -uch  as  NULL-ARC  can  exist  at  intermediate 
stage-  of  rule  application,  but  are  removed  from  the 
network  by  the  application  of  lexical  rules  that  delete 
these  arcs. 

(2)  Arc-Features:  (e.g.  SYLLABIC  CONSONANTAL 
STOP  NASAL)  These  features  represent  the  linguistic 
properties  of  this  arc. 


44 


V  ", 


(3)  Arc-Probabilities:  These  are  used  to  represent  the 
probability  of  different  pronunciations.  The  sum  of  the 
probability  of  all  arcs  that  leave  each  node  should  be 
equal  to  1.0. 

(4)  Arc-lnterword-Boundary-Constraints:  Path  constraints 

determine  which  of  the  pronunciation  paths  through  the 
network  are  allowable.  Whether  or  not  a  path  is 
allowable  depends  on  the  preceding  and  following  word 
context.  Each  arc  that  leaves  from  the  start  node 
contains  a  LET  INTERWORD- BOUNDARY- 

CONSTRAINT.  Lach  arc  that  arrives  at  the  end-nude 
contains  a  RIGHT- INTER  WORD-BOUNDARY- 
CONSTRAINT.  Other  arcs  in  the  network  do  not 
contain  path  constraints.  When  the  right-interword- 
bounda’y-constraint  of  the  last  arc  in  one  network  is 
COMPATIBLE  with  the  left-interword-boundary 
constraint  of  the  first  arc  in  another  network,  the 
pronunciation  paths  aie  compatible,  and  arc  allowed  to 
follow  each  other  when  parsing  a  sentence. 

(5)  Arc-Rule-Bookkeeping-Infnrmation:  This  information  is 
ured  to  keep  track  of  which  rules  generated  different 
pronunciation  paths.  This  information  allows  the  system 
to  use  hand-transcribed  data  to  compute  the  probability 
of  different  rules 
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FIGURE  1.  Sample  Pronunciation  Network  of  the  word 
"HAUPTMAN'S" 

[Note.  implies  that  this  arc  is  nasalized,  ")"  implies:  the 
previous  word  context  determines  whether  this  pronunciation 
path  may  or  may  not  be  taken,  "("  implies:  this  following 
word  context  determines  whether  this  pronunciation  path 
may  or  may  not  be  taken,  AW'  is  AW  with  primary  stress.) 

We  have  constructed  a  dictionary  of  baseform  representations 
for  3412  of  the  words  in  the  TI-AP  database  The  baseform 
representation  for  each  word  is  a  network,  and  need  not  be  a 
single  string  of  symbols  (i.e.  it  could  be  a  multipath  network) 
Each  baseform  network  begins  with  an  arc  whose  label  ("&"  or 
"%")  represents  a  word  boundary  Some  sample  baseform 
networks  are  shown  in  figure  2. 
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FIGURE  2,  Baseform  networks  for  (a)  the  isolated  word 
"RUDN1CKY”,  ib)  the  sentence  "SEND  A 
MESSAGE" 


The  interactive  RULE  facility  allows  a  user  to  write  a  sequence 
of  phonological  rules.  The  current  rule  set  consists  of  45  rules. 
To  illustrate  the  properties  of  phonological  rules,  an  example  is 
shown  below  Rule  V5  is  automatically  compiled  (by  the  RULE 
system)  into  a  list  of  rule  clauses  (figure  4;  Each  rale  clause 
consists  of:  (1)  a  test  that  must  be  satisfied  to  match  the  ciause  to 
a  single  arc  in  the  pronunciation  network,  (2)  the  action  that  is  to 
be  taken  on  the  copied  arc,  and  (3)  a  word  boundary  identifier 
which  is  used  to  determine  where  the  rule  can  be  broken  up 
across  word  networks.  The  third  clause  of  rule  V5  is  an 
optional  morpheme-boundary  test  that  was  automatically 
inserted  bv  the  rule  compiler  The  insertion  of  the  optional 
morpheme-boundary  clause  allows  rule  V5  to  operate  across 
word  boundaries  Since  this  rule  clause  is  optional,  the  rule 
application  algorithm  may  or  may  not  be  match  this  clause  to  a 
network  art 

(defrule 

:name  V5 

:rule-documei’tation  "W-GL1DE  Vowel  becomes 

SCHWA  W" 

:core  (feature  W-GLIDE) 

left-environment  NIL 

:right-environment  (featurc-and  SYLI  ABIC 

VOCALIC) 

:action  ((replace  phoneme  "AX") 

(insert  "W")) 

:  rule- type  MIT 

:copy-matching-arcs 
:rule-pmbabihty  3 

:application-order-number  2050 

) 

FIGIHE  3.  A  sample  rule 


TEST  TO  MATCH  ARC 

COPIED  ARC  ACTION 

WORD 

BOUNDARY  ID 

I.  (FEATURE  W-GLIDE) 

(R  P  PLACE-PHON  FM  E 

AX) 

NONE 

2.  NOME 

3.  (OPTION  \L  (FEATURE 

(INSERT  W) 

NONE 

MORPHEME  .BOUNDARY)) 
4.  (FE  ATURE-AND 

(DO-NOTH'NG) 

VS- 1 

SYLL  ABIC  VOCALIC) 

(DO-NOTHING) 

NONE 

FIGURE  4.  The  set  of  rule  clauses  that  rule  V5  is  compiled 

into. 


The  word  boundary  identifiers  are  used  (1)  to  indicate  lo  the  rule 
application  algorithm  where  the  rule  elauses  can  be  split  across 
individual  word  networks,  and  (2)  in  the  interword  boundary 
constraints  that  determine  which  pronunciation  paths  of  one 
network  can  follow  the  pronunciation  paths  of  the  previous 
network.  All  rule  clauses  that  can  match  the  initial  word 
boundary  symbol  of  a  pronunciation  network  are  "possible 
break  points"  where  a  phonological  rule  can  be  split  across 
networks.  Since  each  baseform  pronunciation  network  starts 
with  a  word  boundary  symbol  (either  an  &  or  an  ri"  those 
rule  clauses  that  can  match  these  arcs  arc  given  a  UNIQUE  word 
boundary  identifier.  In  rule  V5.  the  third  clause  is  given  the 
unique  interword  boundary  identifier  V5  i.  Each  rule  can  be 
splii  across  nronunciation  networks  at  tne  location  between  the 
clause  with  an  interword  boundary  identifier,  and  the  previous 
clause  In  rule  V5.  this  is  between  the  second  and  third  clauses. 


Each  initial  arc  and  each  final  arc  of  a  word  pronunciation 
network  contain  an  interword  boundary  constraint  These 
interword  boundary  constraints  determine  whether  a  pronuncia¬ 
tion  path  that  begins  one  individual  word  network  is  allowed  to 
follow  a  pronunciation  path  that  ends  another  word  network. 
The  pronunciation  paths  of  the  two  word  networks  can  follow 
each  other  if  the  two  interwurd  boundary  constraints  (of  the  last 
arc  in  network- 1,  and  the  first  arc  in  network-2)  are 
COMPATIBLE.  Each  boundary  constraint  consists  of  a  list  of 
single  boundary  constraints.  Each  single  boundary  constraints 
contain  two  lists:  a  list  of  OPTIONAL  word  boundary  identi¬ 
fiers,  and  a  list  of  OBLIGATORY  word  boundary  identifiers. 
For  two  boundary  constraints  to  be  compatible,  each  obligatory 
word  boundary  identifier  in  one  interword  con'traint  must  be 
contained  in  the  list  of  either  the  obligatory  or  the  optional  word 
boundary  identifier  of  the  other  interword  constraint. 

3.  PROBABILITIES 

The  pronunciation  network  of  a  word  initially  consists  of  the 
baseform  network.  Each  of  the  phonological  rules  are  applied 
sequentially  to  each  basetorm  network.  To  apply  a  rule  to  the 
network,  the  application  algorithm  searches  the  network  for  a 
series  of  arcs  that  satisfy  the  list  of  rule  clauses.  When  a  series 
ot  arcs  arc  found  that  match  the  rule  clauses,  the  matching  arcs 
are  copied,  and  the  (1)  labels,  (2)  features,  (3)  interword 
boundary  constraints,  and  (4)  path  probability  of  these  copied 
arcs  are  modified.  The  oath  probability  of  the  matching  arc 
sequence  is  multiplied  by  the  probability  of  the  rule.  The  new 
arcs  represent  an  alternate  pronunciation  path.  Since  the  rules 
are  applied  sequentially,  the  pronunciation  paths  generated  by 
previous  rules  may  be  used  to  match  the  rule  clauses  of 
following  phonological  rules. 

To  illustrate  how  a  rule  is  applied  to  a  network,  we  can  look  at 
figures  5  and  6.  The  purpose  of  sample  rule  RULE-1  is  to 
allow  the  alternate  pronunciation  "E"  to  a  scries  of  arcs  ("A" 
"C").  The  4  stages  of  rule  application  are  illustrated  in  figure  6: 
(1)  the  network  is  exhaustively  searched  for  all  series  of  ar's 
that  match  the  rule;  for  each  of  those  senes  of  arcs  that  match, 
the  following  steps  are  taken:  (2)  the  matching  arcs  of  the 
network  are  broken  out  into  a  separate  linear  path,  (3)  the 
matching  arcs  are  copied  and  subsequently  modified  by  the 
actions  of  the  rule  clauses,  (4)  the  network  is  converted  into  a 
minimum  deterministic  graph.  The  algorithm  is  described  in 
much  greater  detail  in  Appendix  1 . 

(defrule 

:name  RlTJE-l 

:rule-documentauon  "illustrative  example" 

:corc  ((phoneme  ’A")  (phoneme  "C")) 

deft-environment  NIL 

:right-envimnnwnt  NIL 

action  ((replace-phoneme  ’E") 

(delete-phoneme)) 

:  rule-type  TEST 

:copy-matching-arcs  T 

:rule-probability  .5 

application-oirler-numbcr  1 

) 


FIGURE  5.  A  sample  rule  that  will  be  used  to  demonstrate 
how  rules  are  applied  to  the  pronunciation 
network. 


A  .5  C  .5 


E  .125  ((RULE-OJ^"-^  N„  ,  „ 


E  .125  ((RULE-I)) 


FIGURE  6.  The  sample  rule  in  figure  5  is  applied  to  a 
network. 


(la:  The  original  retwork. 

6b:  The  clauses  in  RULE-1  have  been  matched  against  the 
network,  and  the  matching  path  ("A"  "C")  has  been 
expanded  into  a  linear  path. 

6c:  The  matching  path  ("A"  "C")  is  copied,  and  the 
appropriate  actions  are  taken  on  these  arcs.  The  network 
probabilities  are  modified,  and  the  rule  bookkeeping 
indicates  which  rule  generated  this  new  path. 

6d:  The  network  is  convened  into  a  minimum  deterministic 
graph. 

After  the  rule  clauses  have  been  successfully  matched  to  a 
sequence  of  arcs  in  the  network,  the  network  is  expanded  into  a 
senes  of  linear  arc  sequences  (see  figure  6,  top  right).  This 
network  expansion  is  necessary  so  that  the  pronunciation 
probabilities  can  be  modified  in  a  correctly.  The  original 
matched  arcs  are  then  copied,  and  modified  by  the  actions 
specified  in  the  rule.  The  path  probability  of  the  newly  modified 
arc  sequence  is  equal  to  the  probability  of  the  original  matched 
arc  sequence  multiplied  by  the  rule  probability.  The  path 
probability  of  the  original  matched  arc  sequence  is  multiplied  by 
(-  1 .0  rule  probability).  Finally,  the  network  is  converted  into  a 
minimum  deterministic  graph,  maintaining  the  correct  path 
probabilities  and  rule  bookkeeping  information.  This  algorithm 
is  described  in  more  detail  in  apircndix  1 

To  compute  the  probability  of  each  phonological  rule,  a  database 
of  hand  transcribed  speech  is  necessary  For  each  utterance  in 
the  database,  a  set  ot  phonological  rules  is  applied  to  a  sentence 
baseform  network  to  create  a  pronunciation  network  for  'hat 
sentence.  Usirg  the  hand  'ranscribed  data,  the  pronunciation 
path  through  the  network  is  computed.  Beginning  at  the 


Stan  node,  RULE  traverses  the  pronunciation  network  along  the 
observed  path.  At  each  node  along  this  path,  RULE  can 
compute  a  list  of  all  the  different  phonological  rules  that 
generated  any  of  the  arcs  that  leave  from  the  node.  For  each  of 
the  rules  that  are  in  this  list,  we  increment  their  P08SIBLY- 
AFPLIED-COUNT  by  1.  We  then  increment  the  ACTUALI  Y- 
APPLIED-COUNT  of  all  the  phonological  rules  that  were  used 
to  generate  the  traversed  arc.  When  we  have  finished 
processing  all  the  utterances  in  the  database,  the  probability  of 
each  phonological  rule  is  equ.  .  (//  ACTUALLY- APPLIED- 
COUNT  POSSIBLY-APPLIED-COUNT). 

4.  INTERWORD  RULE  APPLICATION 

To  allow  phonological  rules  to  apply  across  words  when  each 
word  is  stored  in  a  lexicon  as  a  separate  virtual  network,  two 
major  changes  were  made.  These  changes  consisted  of  (11  a 
new  rule  application  strategy,  and  (2)  the  addition  of  .in 
interword  boundary  constraint  tha'  determined  which  pronuncia¬ 
tions  of  a  word  were  possible,  based  on  the  previous/following 
word  context 

Each  rule  consists  of  a  series  of  mle  clauses.  During  rule 
application,  each  rule  clause  needs  to  be  matched  against  an  arc 
in  the  network.  In  order  to  deal  with  interword  effects,  the  new 
rule  application  algorithm  needs  to  allow  partial  sequences  of 
rule  clauses  to  be  matched  against  a  series  of  arcs  in  the 
network,  with  the  remaining  c'auses  to  be  applied  to  the 
previous/following  word.  Since  each  baseform  pronunciation 
network  of  an  individual  word  starts  with  a  word  boundary 
symbol  (either  an  or  an  only  rule  clauses  that  can 
match  these  arcs  can  split  a  rule  across  a  network  boundary. 
Therefore,  each  rule  clause  that  can  match  these  arcs  is  given  a 
unique  word  boundary  identifier.  Rule  clauses  that  rannot 
match  a  word  boundary  symbol,  cannot  be  locations  where  a 
rule  is  broken  up  across  word  boundaries.  Each  rule  can  be 
split  across  pronunciation  networks  at  the  location  between  the 
clause  with  an  interword  boundary  identifier,  and  the  previous 
clause.  In  rule  V5,  this  is  between  the  second  and  third  clauses. 
An  example  of  how  rule  V5  (see  figure  3)  can  be  broken  up 
across  word  boundaries  is  shown  in  figure  7. 


TEST  TO  MATCH  ARC  COPIED  ARC  ACTION  WORD 

BOI  NDARY  ID 


I  (FEATURE  W-GUDE)  tREPLALE-PHONEME 

AX)  NONE 

2.  NONE  (INSERT  W)  NONE 

•**  BREAK  ACROSS  NETWORKS  HERE  *** 

3  (OPTIONAL  (FEATURE 

MORPHFME-BOUNDARY))  (DO-NOTHING)  V5-I 

4.  (FFATURE-AND 

SYLLABIC  VOCALIC)  (DO-NOTHING)  NONE 


FIGURE  7.  How  the  clauses  of  rule  V5  can  be  split  across 
word  boundaries.  To  apply  this  rule  across  network  bound¬ 
aries,  the  first  clause  matches  the  last  arc  in  network- 1,  while 
the  third  clause  matches  the  first  arc  in  network  2  Since  the 
second  clause  is  an  insertion,  it  does  not  need  to  match 
anything  in  eitner  network  The  changes  in  pronunciation 
(both  the  AX  and  the  VV )  will  be  as  ociated  with  network- 1 

The  rule  application  algonihm  was  modified  to  allow  oartial 
sequences  of  rule  clauses  to  be  applied  to  networks.  When  the 
rule  application  algorithm  (in  appendix  1)  reaches  the  end  of  a 
network,  and  also  encounters  an  interword  boundary  identifier 
on  the  next  clause  to  match  the  network,  it  allows  the  rule  to  be 
split  across  networks  An  example  of  rule  V5  being  split  across 
network  boundaries  is  shown  in  figure  8. 


&  HH  AW 

•  »  •  * 


A  HH  AW 


FIGURE  8.  Top:  The  baseform  network  for  the  word 
"HOW".  Bottom:  After  rule  V5  has  been  applied  to  the 
network  The  new  pronunciation  pa'h  that  has  been  added  is 
highlighted.  The  "("  of  the  "W("  means  that  this  pronuncia 
tion  path  can  only  be  taken  if  the  following  word  satisfies 
certain  conditions,  in  this  case  that  the  beginning  of  the 
following  word  network  has  a  syllabic  vocalic  arc  just  inside 
the  word  boundary  as  specified  by  the  last  two  clauses  of 
rule  V5. 

The  new  pronunciation  path  ["AX",  "W"]  of  the  word  "HOW" 
in  figure  8,  may  only  be  traversed  if  the  following  word  satisfies 
the  interword  boundarv  constraint.  This  is  because  the  new 
pronunciation  path  is  conditional  on  the  features  of  the  word  that 
follows  it.  The  new  pronunciation  of  the  word  "HOW"  mav 
only  be  used  if  the  next  pronunciation  network  starts  with  a 
morpheme  boundary  followed  by  an  arc  that  is  both  syllabic  and 
vocalic.  This  interword  boundary  constraint  indicates  where  the 
rules  were  split  across  network  boundaries,  and  which 
word-edge  paths  arc  consistent  with  each  other. 

5.  BASEFORMS  AND  PHONOLOGICAL  RITES 
IMPLEMENTED  IN  RULE 

For  CMU's  Electronic  Mail  Task,  SRI  implemented  a 
recognition  lexicon  that  recognizes  multiple  pronunciations  of 
most  vocabulary  words.  The  variant  pronunciations  (e.g. 
"decision"  with  or  without  a  tense  first  vowel,  or  "capacity" 
with  flap  or  an  aspirated  [t])  can  be  directly  represented  in  the 
baseform  list  or  they  can  be  derived  by  rule  from  single 
baseforms.  In  SRl's  work  thus  tar,  we  have  maintained  an 
intuitive,  but  principled  split  between  irregular  or  lexica' 
variants,  and  general  regular,  rule  governed  variation.  By  this 
criterion,  the  forms  of  "exit"  with  voiced  [gz]  or  voiceless  [ksl 
are  explicit  in  the  baseform  list;  while  the  flap  and  [t]  forms  of 
"capacity"  are  handled  by  rule.  This  kind  of  split  is  possible  in 
RULE  but  not  required. 

As  of  this  writing,  the  rule  set  that  SRI  has  implemented 
consists  of  about  45  rules  that  are  separated  into  eight  groups. 
The  rules  within  a  group  apply  more  or  less  in  oarallel,  while  the 
members  of  lower  numbered  groups  apply  before  members  of 
higher  numbered  groups.  The  groups  are 

0:  Expansions  --  Convenient  redundancies  like  insertion  of 
silences  and  glottal  stops  as  appropriate  at  word  boundaries. 

I :  Lexical  and  Dialectal  Variants  —  Regular  diaiectal  and  free 
variant  forms  such  as  Av  or  /wh/  in  "where''  etc.,  or 
initial -syllable  tense-lax  alternations  in  words  like  '  demand 
and  "deny". 

2:  5 /llal  Nucleus  Core  --  Deletion  of  unstressed  initial 
owels  and  the  rc-coding  of  diph'bongs  into  schwa-glide 
icouenccs. 

3  H  and  Glide  or  L  iquid  Core  --  Deletions  of  h  and  I  in 
certain  environments. 
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4,  Nasal  Cores  --  A'aimiintion  of  certain  nasals 

5:  Fricative  Core  —  Epenthetic  stop  insertion,  and  several 
assimilations. 

6:  Plosive  Core  —  Several  kinds  of  lemtton,  including  deletions, 
assimilations  and  cluster  reduction. 

7:  Phonetic  Alternates  and  Patches  -  A  catchall  for  covering 
regular  (non-phonological)  correspondences  created  by  the 
logic  of  the  Acoustic-Phonetics  module,  and  for  the  deletion 
of  diacritics  such  as  boundary  marks. 

A  final  note:  There  are  several  types  of  phonological  rules  that 
are  often  used  in  linguistic  descriptions  and  would  be  convenient 
for  building  a  recognition  lexicon  that  are  not  easily  implemented 
in  the  RULE  system.  One  example  is  "alpha"  rule  notation  in 
which  a  variable,  alpha,  is  bound  in  one  clause  of  the  rule  and 
referred  to  in  another.  Alpha  rules  are  useful  in  handling 
assimilations  and  geminate  reductions.  A  last  example  is  the  use 
of  abstract  phones  in  cross  dialectal  baseforms;  that  is  phones 
that  are  not  realized  in  either  dialect  but  are  convenient  ways  to 
represent  a  regular  correspondence  between  forms.  Abstract 
phones  and  alpha  rules  can  be  done  in  RULE  but  they  are  not 
natural  to  its  formalism. 

6.  SUMMARY 

The  RULE  software  system  is  a  set  of  tools  that  allows  one  to 
construct  recognition  lexicons.  This  paper  describes  new 
algorithms  that  apply  phonological  rules  to  pronunciation 
networks.  The  novel  aspects  of  the  algorithm  involve:  (1)  rule 
application  to  probabilistic  pronunciation  networks,  and  (2) 
generation  of  interword  phonological  effects  when  phonological 
rules  are  applied  to  individual  word  models.  With  a  hand 
transcribed  database,  we  can  automatically  train  the  probability 
of  each  of  the  phonological  rules,  and  use  these  phonological 
rules  to  create  accurate  probabilistic  pronunciation  networks  for 
each  word  in  the  vocabulary. 


APPENDIX  1 :  ALGORITHM  FDR  APPLYING  A  SET  OF  RULES  TO 
THE  NETWORK 

FUNCTION:  APPLY-SET  OF-RLLES-TO  NETWORK 
(list-ol-ordered-rulestretwjfk) 

Gei  the  fast  rule  of  the  Ust-of-ordered-ru  les 

Determin..  rules  to  'pply  in  p?nllel  with  this  rule  -> 
PARALLEL-RULE-LIST 

REMAINING -RULE-LIST  contains  rules  that  remain 

RULE-APPLICATION-PASS  »  1: 

Loop  for  each  nil,  in  paral'd-rule-list 
Loop  for  each  node  in  the  network. 

If  this  is  the  start-node  of  the  network, 

THEN 

Lo  p  for  each  rule  clause  that  has  a  word 
boundary  identifier 

CALL:  MATCH-CLAUSE-LIST-TO- 
NETWORK 

(remaining-ru  le-r lause  listnode) 

ELSE 

CALL:  MATCH -CLAUSE-LIST  TO- 
NFTWORK  i  rule-clause-list,  node) 

Collect  the  arc  sequences  successfully  matched  by  rule  clauses.  Fxpand 
network  so  thjt  all  matched  arc  sequences  are  separ  ited  rut. 


RULE-APP'  .ICATION-PA  SS  #  2:  [same  loops  and  calls  a.  pass  »2] 

Convert  the  netwerk  into  s  minimum  deterministic  graph. 

If  there  are  any  rules  remaining, 

THEN  CALL:  APPL  Y-SFT-OF-RULES-TO-NFTWORK 
(remuining-rule-lisi,  network) 

FUNCTION:  MATCH-CLAUSE-LIST-TO-NETWORK 
( rule-cl  ause-Iis  mode) 

If  (OR  [  rule-clause  list  is  empty  1 

(AND  [  node  is  the  end-node  of  the  netwr-k  1 
[  the  next  rule  clause  thai  has  an  arc  test 
also  has  a  word  hour  dary  identifier! 

)) 

THEN 

CALL'MODIFY-NETWORK-BY-APPLYING-RULE 

ELSE 

Loop  for  arcs  that  leave  this  rode 

If  arc  satisfies  fust  remaining  rule -clause, 

THEN 

CALL:  MODIFY-NETWORK-BY- 

APPI  YING-RULE 

((edr  rule-claus  :-lisi),(to-nodt  arc)) 

ELSE 

If  rule -clause  is  an  optional  clause, 

THEN 

CALL:  MODIFY-NETWORK- 
BY-APPLYING-RULE 
((edr  ru!e-dause-list),node) 


FUNCTION:  MODIFY-NETWORK-BY-APPLYING-RULE 
(networkjuie  applieauon-pass-number, 
un ms tched-left-rule -clauses, matched-rule -clauses, 
unmatched-right-ruie-clamcs) 

If  fusi  pass,  then  collect  matched  arc  into  a  temp  data  structure.  If  this  is 
the  second  pass  of  the  rule  application, 

THEN 

If  there  are  no  unmatched  left  or  right  rule  clauses, 

(rule  does  not  apply  acros.  n  etwork  boundari.s ) 

THEN 

If  the  rule  copies  the  matching  arcs, 

then  copy  the  arcs,  modify  the  probability  ard 

bookkeeping  of  this  path. 

Apply  action  to  each  copied  arc  (typically  modify 
arc  label  or  features). 

ELSE 

'there  aie  ome  unmatched  rule  clauses, 
rule  that  applies  across  network  boundsn  ss.) 

Loop  through  matched  rule  clauses. 

If  clause  action  modifies  the  network. 

THEN 

If  the  rule  copies  the  matching  arcs,  then  copy  the 
arcs,  modify  the  probability  and  bookkeeping  of 
this  path. 

Applv  ucuiin  to  each  copied  arc  l  typically 
modify  arc  label  or  features). 

Modify  interword  conitraints  of  copied  arcs  to  contain 
obligatmy  interword  idenulier. 

ELSE 

Modify  interword  constraints  of  the  matched  arcs  to 
contain  opuonal  in-srword  identifier 
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ABSTRACT 

In  the  pa.1  year  SRI  has  undertaken  a  series  of  empirical  studies  of 
phonological  variation.  The  goal  has  been  to  find  better  lexical 
representations  of  the  structure  and  variation  of  real  speech,  in  order 
to  provide  speaker  independence  in  speech  recognition.  Results  from 
these  studies  indicate  that  knowledge  of  probabilities  of  occurrence 
of  allophonic  forms,  co-oocurrence  of  allophonic  forms,  and  speaker 
pronunciation  groups  can  be  used  to  lower  lexical  entropy  (i.e  , 
improve  predictive  ability  of  lexical  models),  and  possibly  therefore, 
achieve  rapid  initial  adaptation  to  a  new  speaker  as  well  as  ongoing 
adaptation  to  a  single  speaker 


INTRODUCTION 

As  the  number  of  words  in  the  lexicon  grews,  the  speech  recognition 
problem  gets  more  difficult.  In  a  similar  way  as  more  possible 
pronunciations  for  each  word  are  included  in  the  lexicon,  the  recog¬ 
nition  problem  gets  more  difficult  because  there  are  more  competing 
hypotheses  and  there  can  be  more  overlap  between  the  representa, 
lions  of  similar  words 

One  important  goal  of  a  lexical  representation  is  to  maximize  cover¬ 
age  of  the  pronunciations  the  system  will  have  'o  deal  with,  while 
minimizing  overcoverage.  Overco"erige  adds  unnecessary  difficulty 
to  the  recognition  problem.  One  way  to  maximize  coverage  while 
minimizing  overcoverage  is  to  explicitly  represent  all  possible 
pronunciations  of  each  vocabulary  word  a,,  a  network  of  allophones. 
An  example  of  ouch  a  network  is  shown  in  figure  1  for  the  word 
'water"  This  network  represents  eight  possible  pronunciations, 
some  of  which  are  fairly  common  (c.g.,  [W  AO  DX  ER|),  and  ethers 
somewhat  rare  (e.g.,  [W  AA  T  AXj).  Experience  suggests  that,  to 
assure  coverage,  it  will  be  necessary  to  include  many  pronunciations 
for  each  word,  including  those  which  happen  relatively  rarely 

In  reality,  speech  is  more  highly  organized.  There  is  more  predictive 
knowledge  available  than  in  a  model  that  simply  represents  indepen¬ 
dent  eqniprohablc  choices  with  no  interaction  or  influence  between 
different  parts  of  a  model  and  with  no  ability  to  use  information 
from  other  parts  of  an  utterance  or  previous  utterances  by  the 
current  speaker  In  current  systems  which  use  allophonic  mofels, 
each  node  represents  an  "independent  set  of  eqniprohablc  choices 

lThi-  research  wu  soossored  ty  Defense  Advanced  Research  Projects  Agency 
Contract  N0003Q-8&-C-0302.  The  views  and  conclusions  contained  in  this  document 
are  those  of  the  authors  and  ahould  not  be  interpreted  as  representing  the  official  pol- 
iciea,  either  expresaed  or  implied,  of  the  Defense  Advanced  Research  Projects  Agency 
or  the  US  Government. 


The  goal  of  the  research  described  in  this  paper  is  to  explore  ways  in 
which  a  lexical  representation  can  better  reflect  the  structure  of  real 
speech  data,  so  that  the  representation  will  have  more  predictive 
power,  and  thus  improve  recognition  accuracy.  A  better  under¬ 
standing  of  the  issues  involved  may  lead  to  methods  for  rapid  adap¬ 
tation  to  a  new  speaker,  as  well  as  ongoing  adaptation  to  a  single 
speaker  during  a  single  session. 

In  order  to  explore  these  issues,  we  chose  to  model  (as  a  single  utter¬ 
ance)  a  pair  of  sentences  containing  21  words  for  which  we  had  a 
large  data  set.  The  patterns  of  variation  found  for  this  21-word 
microcosm  should  indicate  what  kinds  of  structures  will  he  needed  in 
a  larger  lexicon  The  data  used  were  transcriptions  of  the  two 
dialect  sentences  for  the  630  speakers  in  the  TI-AP  database. 

We  have  performed  a  series  of  four  studies  that  explored  four  types 
of  phonological  structuie,  and  ways  of  representing  this  structure  in 
a  lexicon.  In  the  first  study,  we  simply  looked  at  the  gain  in  predic¬ 
tive  ability  of  a  phonological  model  which  incorporates  knowledge  of 
the  probabilities  of  the  various  possible  word  pronunciations.  The 
second  study  explored  the  co-occurrence  of  allophonic  forms,  and 
ways  in  which  knowledge  of  these  co-occurrences  can  be  automati¬ 
cally  compiled  into  a  phonological  model.  The  third  study  explored 
the  possibility  of  grouping  speakers  into  a  small  number  of  pronunci¬ 
ation  clusters,  and  looked  for  demographic  and  other  predictors  of 
these  pronunciation  clusters.  The  fourth  study  was  designed  to 
compare  intra-speaker  variation  to  the  variation  within  the  pronun¬ 
ciation  clusters  defined  hy  the  third  study. 

To  evaluate  our  data,  and  compare  representations,  we  used  entropy 
aa  a  measure  of  the  predictive  power  of  a  representation,  or 
difficulty  of  the  recognition  task  given  a  particular  representation. 
The  entropy  of  a  representation,  developed  from  or  ''trained,,  on 
some  large  set  of  data,  reflects  both  how  well  the  representation  cap¬ 
tures  significant  structure  in  the  data  and  how  much  predictive 
power  is  gained  by  modelling  this  structure 

The  four  studies  are  described  in  the  following  four  sections,  fol¬ 
lowed  by  a  general  discussion  and  conclusions. 

PRONUNCIATION  PROBABILITIES 

The  goal  of  the  first  study  war  to  determine  how  much  speech  recog¬ 
nition  accuracy  could  be  improved  by  incorporating  knowledge  of 
pronunciation  probabilities  into  a  phonological  language  model  An 
important  goal  of  any  lexical  representation  is  to  provide  coverage 
of  the  pronunciations  that  the  system  will  have  to  to  deal  with, 
including  relatively  rare  pronunciations.  This  makes  the  recognition 
problem  more  difficult  hecause  there  are  more  competing  hypotheses 
and  can  be  more  overlap  between  word  models  One  way  to  deal 
with  this  problem  is  to  include  probabilities  for  pronunciations  in 
the  lexical  model.  In  this  way,  including  somewhat  rare  pronuncia¬ 
tions  will  increase  coverage  without  hurting  performance.  It  will 


allow  recognition  of  these  unusual  pronunciations,  avoiding  confu¬ 
sion  with  other  more  common  pronunciations  of  similar  words  For 
example  ,  consider  the  allophone  string 

jDH  AX  B  HI  G  W  AA  DX  AX  B  HI  L  Z| 

This  string  contains,  as  a  substring,  the  sequence  jW  AA  DX  AX), 
which  eorreiponds  to  one  of  the  paths  through  the  network  for  the 
word  "water"  shown  in  figure  1,  This  is  a  relatively  infrequent 
pronunciation  of  the  word  "water”  An  alternative  hypothesis  for 
this  same  substring  could  be  the  pair  of  words  "wad  of  ,  for  which 
this  pronunciation  is  relatively  common.  (Suggesting  the  phrase 
"The  big  wad  of  bills"  rather  than  "The  big  water  bills".)  Appropri¬ 
ate  probabilities  associated  with  these  pronunciations  could  allow  a 
system  to  make  a  more  intelligent  choice  Such  a  model  should  help 
recognition  accuracy  significantly,  provided  that  the  probabilities 
us«d  are  accurate  for  the  domain  in  which  the  system  will  be  used, 
and  especially  if  the  probability  distributions  are  significantly 
dillrrent  from  the  default  equi-prebabh  models. 


AA  T  AX 


Figure  1.  Allophone  network  for  the  word 
"water". 

The  data  used  in  this  study  were  transcriptions  of  the  two  dialect 
sentences  for  the  6,10  ipeakers  in  the  TI-AP  database  Originally, 
the  allophonic  forms  used  for  each  of  58  phonemes  in  the  two  test 
sentences  as  produced  hy  the  630  speakers  were  transcribed  by  Mar¬ 
garet  Kahn,  Jared  Bernstein,  or  Gay  Baldwin.  The  trance nption: 
were  done  carefully  using  a  high  fidelity  interactive  waveform  editor 
with  a  convenient  mean;;  to  mark  and  play  regions  in  a  high  rcsolu 
tion  image  of  the  waveform.  Spectrograph ic  and  other  analytic 
displays  were  also  easily  available  though  most  of  the  work  was 
done  by  ear  and  by  visual  inspection  of  the  waveforms.  Subse¬ 
quently  a  subset  of  18  of  these  segments  was  chosen  which  we  felt 
we  could  transcribe  accurately  and  consistently;  the  18  are  diipro- 
portionately  consonantal  This  subset  of  18  segments  in  the  original 
630  transcriptions  were  then  rc-chcckcd  and  corrected  by  one  indivi¬ 
dual  (Gay  Baldwin)  The  transcriptions  of  these  18 
segments/utterance  were  compared  to  a  subset  of  156  speakers 
whose  sentences  had  heen  independently  transcribed  at  MIT  at.  part 
of  a  related  project  For  this  subset  of  156  speakers,  the  number  of 
transcription  disagreements  between  SRI  and  MIT  was  about  5-10% 
for  a  typical  phoneme 

Figure  2a  shows  the  two  dialect  sentences,  indicating  the  segments 
included  in  this  study,  along  with  the  distributions  of  allophones 
found  for  each  of  these  phonemes.  Figure  2b  shows  the  18  node 
allophone  model  used  to  represent  the  possible  pronunciations 
Among  these  18  phonemes,  at  the  level  of  transcription  we  used, 
there  are  14  two-way  splits,  two  three- wav  spliu,  and  a  six  and  a 
seven-way  split  The  distributions  vary  from  a  l%-99%  split  for 
canonical  /'./  vs.  (lap  in  "water"  to  a  60%-40%  split  for  the 
alfricated  vs  non-afTricated  ,dy/  juncture  in  "had-your"  to  a  -I7%- 
53%  split  for  a  glottal  gesture  at  the  beginning  of  "oily". 


The  seven-way  split  for  the  juncture  in  "suit  in"  is  the  most 
unpredictable  The  potentially  variablt  events  are  the  burst  of  the 
/t/,  the  occurrence  of  a  glottal  onset  to  "in"  and  the  presence  or 
ahsence  of  the  vowel  in  "in"  A  third  of  the  readers  produced  a  very 
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Figure  2. 

a)  observed  percentages  of  allophonic  forms 

b)  18  node  utterance  model. 


clear  form  that  exhibited  a  t-burst  and  a  glottal  stop  or  glottaliza- 
tion  at  the  onset  of  the  / 1  /  in  "in".  A  quarter  of  the  readers  flapped 
or  produced  a  short  /d /  into  the  vowel  in  "in"  Nineteen  percent  of 
the  utterances  showed  no  burst  for  the  ft/  but  a  glottal  gesture  into 
the  /I/,  while  18%  showed  the  same  buntless  /t/  with  glottal  ges¬ 
ture,  but  released  the  gesture  directly  into  the  natal,  deleting  the 
/!/  Three  percent  of  the  readers  (19  people)  released  the  ftf  with  a 
burst  right  into  the  /!/,  3  speakers  (0.6%)  had  a  dear  t-bunt,  but 
the  glottal  gesture  goes  right  into  the  mual  with  the  /I/  deleted 
One  speaker  (0.2%  of  the  sample)  produced  the  "suit  in"  juncture  as 
a  /d/  with  a  velic  release  into  the  najal  (as  in  a  word  like  "sud¬ 
den  )  The  distributions  are  surprising  only  as  reminders  of  how  lit- 
tie  quantitative  da'a  on  the  relative  friquency  of  occurrence  of  allo¬ 
phones  is  available.  What  experienced  phonetician  could  have 
estimated  th  proportion  of  these  forms  in  reading’  It's  no  wonder 
that  speech  recognition  lexicons  would  have  whatever  allophonic 
options  they  allow  unspecified  as  to  relative  likelihood. 

The  approach  used  in  this  study  was  to  compute  the  probabilities  of 
sach  of  the  transitions  in  the  18-node  allophone  model  (figure  2b) 
from  a  large  database  of  speech.  The  entropy  of  this  model  was 
computed  and  compared  to  the  entropy  of  a  similar  model  without 
probabilities  estimated  from  data,  in  which  case  ail  transitions  from 
a  node  are  considered  cquiprobahle  Information  theoretic  entropy 
H,  of  an  arbitrary  itring,  S,  in  the  language  way  computed  as 

n(s)~-T,T,r(tW(t)  (!) 

where  n  ranged  over  all  of  the  the  nodes  in  the  utterance  model  t 
ranged  over  all  of  the  transitions  from  the  current  node,  and  P(l)  is 
the  probability  of  transition  t  This  is  the  same  as: 

//(5  )»-£/’(., log^(.)  (2) 

where  s  ranges  over  ail  of  the  strings  in  the  language  IMcEliece 
1977' 
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The  entropy  meaiured  for  the  model  with  equi-probable  transitions 
is  22.8  hit),  tnd  for  the  model  with  empiricslly  estim  ted  probabili¬ 
ties  is  13.0  bits.  This  represents  an  increase  if  42,4%  in  predictive 
ability  (knowledge  or  source  of  constraint)  for  the  model  with 
trained  probabilities  Presumably,  this  further  constraint  should 
translate  into  improved  recognition  accuracy. 

CO-OCCURRENCE  OF  ALLOPHONIC  FORMS 

The  goal  of  the  next  study  was  to  explore  co-occurrence  relation¬ 
ships  in  allophonic  variation.  A  co-occurrence  relationship  is  one  in 
which  the  prohahility  of  the  occurrence  of  a  particular  variant  is 
conditioned  on  the  presence  or  absence  of  some  other  variant  in 
another  part  of  the  utterance.  Kno  vledge  of  such  co-occurrence 
relationships  can  be  used  to  increase  predictive  power  akout 
allophonic  variation 

The  dat*.  used  in  this  study  was  the  same  as  that  used  in  the  previ¬ 
ous  study,  except  that  the  realisation  of  /k/  in  ’dark’  was  excluded, 
since  we  decided  w-  had  insufficient  confidence  in  our  transcriptions 
of  that  phoneme.  All  possible  pairings  of  the  remaining  17 
phonemes  (138  pairs)  were  tested  for  co-occurrence  relationships. 
The  two  examples  in  figure  3  demonstrate  the  technique  For  each 
pair  of  segments,  counts  of  all  combination*  of  variants  for  the  two 
forms  were  entered  into  a  matrix.  Chi-square  tests  were  performed 
on  these  matrices  it  the  97. 5%  confidence  level. 

The  example  in  figure  3a  illustrates  the  analysis  of  glottal  (or  no 
glottal)  at  the  beginning  of  ’all’  and  ’oily’  The  table  shows  that, 
of  the  830  speakers,  230  used  a  glottal  gesture  (either  a  full  glottal 
stop  or  a  weaker  gesture  seen  a  several  irregular  glottal  periods, 
both  symbolised  here  as  [7])  at  the  heginning  of  both  ’all’  and 
"oily".  One  hundred  eighty-four  speakers  didn't  use  (?]  before  either 
word,  85  speakers  put  [tj  just  on  "oily",  and  151  just  on  "all".  The 
chi-square  is  significant  at  the  97.5%  level,  indicating  that  this  pat¬ 
tern  of  co-occurrence  of  glottab  at  the  heginning  of  "all"  and  "oily" 
is  rather  unlikely  to  happen  by  chance  if  we  assume  that  the  two 
events  are  independent.  In  other  words,  speakers  who  used  (7) 
before  "all"  were  more  likely  to  use  [7|  before  "oily"  as  well.  Simi¬ 
larly,  if  [7|  was  omitted  before  "all",  it  was  less  likely  to  be  found 
before  "oily".  This  case  of  co-occurrence  is  not  surprising,  because 
both  forms  could  be  considered  to  result  from  the  same  phonological 
rule. 
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Figure  3.  Co-occurrence  Examples: 

a)  onsets  In  "all"  and  "oily". 

b)  onsets  for  "an"  and  "to". 
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The  co-occurrence  matrix  in  figure  3b  show-  a  dependent  relation¬ 
ship  between  forms  that  are  phonologically  heterogeneous.  In  this 
case  speikere  who  use  ft]  rather  than  flap  in  ’to"  show  a  strong  ten¬ 
dency  to  use  (rather  than  omit)  [7]  before  ’an".  This  might  be 
interpreted  as  evidence  for  a  higher  level  fist-speech  (or  lax  style) 
"macro-rule",  which  increases  the  likelihood  of  several  types  of  pho¬ 
nological  rules.  One  goal  of  our  work  is  to  establish  a  method  by 
which  such  functional  rule  groups  can  be  found  (or  dismissed).  For 
now  we  just  present  preliminary  data  that  show  non-independence 
between  pnirs  of  forms  over  this  sample  of  utterances 

Figure  4  shows  which  of  the  138  possible  co-occurrences  actually  had 
chi-squared  vjiies  that  indicated  non-independence.  The  confidence 
level  for  the  chi-squared  value  was  97  5%,  meaning  that  of  the  138 
chi-squares  calculated,  one  could  expect  about  four  artifactually 
non-independence  between  co-occurring  forms.  Of  these  37,  about 
15  involve  pairs  that  have  a  clear  phonological  relation,  (r-lessness  in 
"your",  "diuk",  "water”;  (7]  in  "all”  "an"  "oily";  flapping  in 
"water",  "to",  "don’t  asx'  "suit  in";  etc.).  Most  of  the  remainder 
show  dependencies  between  vac, ants  in  more  remotely  reiated  pho¬ 
nological  contexts.  The  number  of  dependencies  is  obvi-usly  consid¬ 
erable,  and  suggests  that  macro-level  relationships  -  dialect  region, 
utterance  speed,  style  sex-linked  variation  -  are  pervasive  enough 
to  be  useful  in  improving  predictions  of  forms  for  automatic  speech 
recognition. 
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Figure  4.  Significant  Chi-squares  for 
form  pairs. 
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There  could  be  significant  advantage  in  finding  a  way  to  compile 
knowledge  about  the  co-occurrence  of  allophoms  fens  into  the  pho¬ 
nologic  J  model.  This  would  allow  a  form  of  nthia  utterance  adap¬ 
tation  to  take  place  automatically.  An  example  of  how  this  might 
be  done  is  shown  in  figure  5.  Figure  3a  shows  a  probabilistic 
language  model  for  a  language  consisting  of  strings  sf  two  symbols, 
the  lint  symbol  being  "A"  half  the  time  and  "B"  the  other  half  of 
the  time,  and  the  lecond  symbol  evenly  divided  hetween  “ C“  and 
’D’.  There  is  additional  structure  to  this  language,  in  the  form  of 
co-occurrence  When  the  first  symbol  is  “A*,  the  prob ability 's  90°5 
that  the  second  symbol  will  be  “C,  and  10^5  that  it  will  be  “D“. 
When  the  brat  symbol  is  'B",  the  distribution  is  reversed.  The 
entropy  of  such  a  model  can  be  calculated  as  1.47  bits.  When  a 
model  representing  the  same  language  is  configured  as  in  figure  5b, 
without  representation  of  the  co-occurrence,  the  -ntropy  is  two  hits, 
one  bit  for  the  choice  at  each  node.  Configuring  the  model  to  reflect 
-o-occurrtnce  knowledge  has  resulted  in  more  than  25^5  lower 
entropy.  Clearly,  the  model  in  5a  can  do  a  better  job  of  predicting 
incoming  strings  in  the  language  th  m  that  'n  5h. 


Figure  5.  Alternative  models  for  a  simple 
language. 


We  performed  a  clustering  study  to  determine  whether  we  "ould 
compile  co-occurTence  knowledge  into  a  phonologic^  model,  hence 
lowering  entropy,  using  an  automatic  procedure.  Considering  each 
sentence  pair  from  a  speaker  to  constitute  one  utterance,  we 
clustered  the  630  utterances  into  the  lowest  entropy  group"  we  could 
hnd.  Each  group  could  "hen  be  used  to  estimate  d'ophone  probabil¬ 
ities  for  an  independent  path  through  the  model  (see  figure  d). 
Grouping  together  utterances  with  similar  ali  pnonic  realisations  in 
this  manner  allows  the  phonological  model  to  capture  CO-occurrence 
Knowledge  by  isolating  co-oecurring  adophones  in  the  same  paths.  If 
there  ;s  significant  co-occurrence  in  the  data,  this  new  model  should 
have  lower  entropy,  and  hence  greater  predictive  power,  than  the 
previous  model. 

The  clustering  technique  used  was  a  combination  of  hierar-hicai 
clustering  and  the  iterative  Lloyd  algorithm  Duns  and  Hart  137i 
For  eactt  specific  numher  of  clusters  desired  tne  data  were  clustered 
into  that  numher  of  groups  using  an  igglomerative  hierarchical  clus¬ 
tering  technique,  and  then  these  clusters  were  used  is  the  oeeds  to 


Figure  6.  Allophone  network  for  the 
2  sentence  utterance,  showing 
clusters  as  separate  paths. 

the  Lloyd  algorithm.  Each  step  of  the  hierarchical  clustering  algo¬ 
rithm  involves  merging  the  nearest  pair  of  distinct  dusters.  Ini¬ 
tially,  each  utterance  forms  a  singleton  duster,  and  the  procedure 
continues  until  the  desired  number  of  clusters  is  reached.  At  each 
step,  the  nearest  pair  of  dusters  was  defined  as  that  pan  whose 
merging  would  result  in  a  model  with  the  lowest  conditional  entropy 
H(S|c),  which  was  computed  at. 

H(S  |  c)--L£Af(i)ff(S  I  ')  (3) 

i-l 

wbtrt  N  —  total  nemi  r  sf  vtttroncct  in  tht  aemplt  (630), 
n  —  cerrcnl  nemicr  0/  tlmlert 


Af(i)“  namter  a/  entrances  in  cltutar  i ,  and 
H(S  |  i )—  entropy  0/  0  1  trinf  S  in  elvater  1 


Hence,  H(S|c)  is  dr  ned  as  the  weighted  average  (weighted  hv  dus¬ 
ter  siie)  of  the  entropies  of  the  individual  dusters,  which  is  the  same 
is  the  entropy  of  a  string,  given  that  you  know  which  cluster  the 
string  falls  into.  Though  the  real  ohjective  of  this  procedure  was  to 
minimise  H(S)  rather  than  H(S|c),  computing  H(S)  for  the  composite 
model  at  each  iteration  of  the  algorithm  is  computationally  too 
expensive.  Though  H(S|c)  is  not  gu  xanteed  to  be  monotonicaily 
related  to  H(S),  it  should  he  in  moat  cases 

In  the  second  phase  of  clustering,  the  clusters  found  by  hierarchical 
clustering  were  used  u  r  seed  to  the  iterative  Llovd  algorithm, 
which  continued  until  the  improvement  for  one  iteration  was  less 
than  a  threshold.  Each  iteration  of  the  Lloyd  algorithm  involved 
the  following: 

1)  For  each  utterance  compute  H(S|c),  as  in  -quation  3  with  ’.his 
utterance  aa  a  memher  of  each  current  duster  -  remember  the  dus¬ 
ter  for  which  H(S|c)  is  minimal 

2)  Once  the  new  duster  is  chosen  for  ail  utterances  actually  make 
the  switches. 

Typically,  the  Lloyd  algorithm  continued  for  5-10  aerations  and  :he 
amount  of  reduction  in  HfS)  over  the  dusters  output  from  the 
hierarehkal  clustering  procedure  was  another  l-i'Ti  lower  ’ban  the 
undustered  model. 
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The  re.  . -It*  of  the  clustering  study  Are  shown  in  figure  7.  The  higher 
curve  H(S)  is  for  the  composite  model  given  10,  20,  sod  30  duster*. 
The  result*  show  that  the  entropy  of  *  phonological  model  can  be 
S  srertd  10*15%  hy  modelling  the  co-occurrence  of  allophonie  forms 
Furthermore,  this  co-occurTen<*e  can  he  modelled  for  any  'ufTiciently 
•arge  data  set  by  running  a  standard  clustering  algorithm,  without 
the  need  to  explicitly  determine  what  the  co-occurrence*  are  The 
significance  of  the  lower  curve  (H(S  |c ))  will  he  discussed  below  in  the 
section  on  speaker  group; 


SPEAKER  GROUPS 

The  lower  curve  in  figure  7  shows  the  conditional  entropy  of  the 
model  given  the  cluster,  computed  as  in  equation  3.  This  mult 
indicates  that  if  the  appropriate  cluster  for  the  incoming  utterance  is 
known  in  advance,  entropy  c.in  he  lowered  30*50%  from  the 
unclustered  model.  The  question  then  .rises  of  how  well  can  we 
predict  the  cluster  for  an  incoming  utterance  This  question,  in 
turn,  raises  a  numher  of  additional  questions: 


Figure  7.  Entropy  of  utterance  model  as  a 
function  of  the  number  of  clusters. 

We  dao  tested  whether  demographic  factor*  and  speech  rate  could 
be  used  to  predict  allophonie  forms  These  result*  are  shown  in 
figure  8.  Ghi-squ  ires  (at  the  97.5%  confidence  level)  were  computed 
to  test  for  independence  between  region  (each  speaker  was  identified 
with  one  of  seven  geographic  regions  or  s  an  ’army  hrat’),  age  (hy 
dec  ide),  race,  sex,  education  (HS,  BS,  MS,  or  PhD),  and  speech  rate, 
vs.  form.  As  can  be  seen,  the  results  show  significant  non¬ 
independence  hetween  all  of  the  demographic  factors  vs.  form  and 
rate  vs  form.  This  indicates  that  all  of  these  fictors  ire  significant 
predic’ors  of  allophonie  occurrences  For  example,  people  from  New 
England  tend  to  say  r-less  ’your’  and  people  from  the  South  tend 
to  say  ’gr»asv’  with  a  i). 


Chi-squares  for  forms  vs.  demographics 
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Figure  8. 


1)  What  explicit  predictors  of  duster  membership  are  available? 
(e.g.,  sex,  region  of  oriem,  speech  rate,  etc  ) 

2)  How  consistently  docs  a  speaker  stay  within  one  cluster'  (i.e  ,  if 
a  speaker  stays  in  the  same  cluster  with  re  -aonihle  consistency,  then 
rapid  adaptation  to  a  new  speaker  may  he  accomplished  hy  chooung 
the  appropriate  'luster  after  some  experience  with  this  speaker,  or 
choosing  an  appropriate  weighting  function  over  the  dusters.) 

3)  How  can  we  classify  a  speaker  into  the  appropriate  duster  or 
choose  the  appropriate  weighting  function  over  clusters  for  this 
speaker  at  the  current  time? 

4)  When  during  a  recognition  session  should  a  new  duster  be  chosen, 
or  a  new  weighting  function  he  computed'  (eg,  when  speech  rate 
changes,  when  performance  drops,  only  when  a  new  speaker  come* 
along,  etc.) 

The  studies  descrihed  in  this  section  were  designed  to  address  the 
first  two  questions 

In  order  to  test  for  predictor*  of  duste-  membership,  we  performed 
chi-squares  at  the  97.5%  confidence  level,  testing  for  non¬ 
independence  hetween  cluster  vs.  form,  duster  vs  all  of  our  demo¬ 
graphic  factors  (age,  race,  region,  sex,  and  education)  and  duster 
vs.  speech  rate  There  was  ugnificant  non-independence  hetween 
cluster  and  all  allophonie  forms  except  for  the  /t/  in  ’water*  as 
well  as  for  all  demographic  factors  and  rate.  The  lack  of 
significance  for  /t /  in  ’w-ter’  is  not  surprising  since  out  of  our 
sample  of  830  speakers,  only  five  of  them  aspirated  the  /l / 

In  order  to  test  the  consistency  with  which  speakers  remain  in  clus¬ 
ters,  we  gathered  a  new  set  of  data  consisting  of  speakers  repeating 
the  same  sentences  many  times  Four  speakers  were  recorded  in 
thr»e  pensions  each,  with  recording  sessions  for  the  same  speaker  a 
week  apart  The  recordings  were  made  in  a  sound-treated  room, 
using  a  dose  talking  microphone  and  a  Nagra  tape  recorder.  Each 
recording  session  consisted  of  eight  readings  of  the  same  two  sen¬ 
tences  used  in  the  experiments  descrihed  earlier,  interspersed  in  a  set 
of  seven  filler  sentences  The  first  five  repetitions  were  unin;  .rue ted 
(i  e  ,  ‘normal  reading’)  At  the  sixth  repetition,  the  suhjects  were 
instructed  to  read  very  quickly,  at  the  seventh  slowly  and  carefully, 
and  at  the  eighth  normally.  From  listening  to  the  recordings  it  is 
our  judgement  that  the  fast  reading*  were,  indeed,  extremely  fast, 
and  the  slow  and  careful  readings  were  extremely  slow  and  careful 
Since  the  uninatructed  readings  were  fairly  fast,  the  differences 
between  the  slow  and  uninstructed  readings  were  more  dramatic 
than  those  hetween  the  fast  and  uninstructed  readings.  Hie  final 
data,  set  -insists  of  96  repetitions  of  the  two  sentences  24  from  each 
speaker,  with  72  repetitions  uninstructed  or  “normal’  12  fast  and 
12  slow  and  careful. 

Th»  same  18  phonemes  used  in  'he  earlier  experiments  were  phoneti¬ 
cally  transcribed,  with  the  aid  of  the  "ools  described  earlier  by 
Michael  Cohen,  and  checked  by  Jared  Bernstein  and  Gay  Baldwin. 
Elach  of  the  98  utterances  were  then  classified  into  the  clusters  based 
on  the  f30-speaKcr  data  as  described  in  the  previous  section  We 
chose  to  classify  them  into  the  10-clustcr  version  so  that  each  cluster 
would  be  baaed  on  a  large  number  of  utterances  (approximately  631 
Each  utterance  was  classified  into  the  cluster  with  the  centroia  with 
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minimal  Euclidean  distance  lo  the  utterance  Table  1  shows  the 
number  of  utterances  for  each  speaker  Ir  sifted  into  each  cluster. 
As  can  he  seen,  most  of  the  utterances  for  each  speaker  tend  to  be 
classified  into  two  or  three  dusters.  Eleven  of  the  12  slow  utter¬ 
ances  were  classified  into  cluster  two.  The  fist  utterances  did  not 
tend  to  fall  into  any  one  cluster. 


Table  1.  Classification  of  speaker  utterancea 
according  to  pre-existing  clusters 
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These  results  indicate  that  although  speakers  fall  into  the  same  dus¬ 
ters  with  some  consistency,  chocsing  a  single  duster  for  a  speaker  is 
inadequate  A  more  reasonable  approach  may  be  to  choose  a 
weighting  function  over  all  of  the  dusters.  Furthermore,  duster 
memb-rship  seems  lo  be  somewhat  dependent  on  speech  rate. 

INTRA-SPE  UCER  VS.  INTRA-GROUP  VARIATION 

The  results  of  the  previous  section  suggest  a  method  of  adaptation 
by  choosing  appropriate  sets  of  (or  weight*  for)  dusters  for  a 
speaker  The  study  described  in  this  section  addresses  the  question 
of  whether  or  not  it  is  useful  to  try  lo  further  adapt  to  the  indivi¬ 
dual  speaker  once  the  dusters  are  chosen  We  have  addressed  that 
question  by  comparing  the  amount  of  variation  within  a  single 
speaker  to  the  amount  of  variation  within  a  single  duster.  If  there 
is  considerably  less  variation  within  a  speaker  than  within  even  a 
single  cluster,  then  there  mav  be  ways  to  further  adapt  lo  the  indi¬ 
vidual  speaker. 

The  dais  used  for  this  experiment  included  both  the  530-speaker 
data  d-senbed  earlier,  ard  the  four  speaker  multi-repetition  data 
described  in  the  previous  section.  We  compared  the  entropy  of  a 
model  trained  for  a  single  speaker  in  the  multi-repetition  data  set  lo 
tb-  entropy  of  a  duster  from  the  -OO-speaker  set.  Only  tbe  18  unin¬ 
structed  utterances  for  each  speaker  were  used  from  tbe  multi- 
repetition  data  bee, .use  tbe  830-speaker  data  were  recorded  without 
nstruction.  The  comparison  was  made  with  tbe  10-duster  version 
of  the  630-speaker  data  so  that  each  cluster  would  be  based  on  in 
adequate  amount  of  data.  In  order  to  be  able  to  moke  a  fair  com¬ 
parison,  it  w«a  necessary  to  compare  the  entropy  of  mode’s  trained 
on  the  same  number  of  speakers,  so  we  sampled  tbe  large  dusters 
from  the  u30  speaker  set  by  randomly  choosing  a  cluster,  and  then 
ranuomly  choosing  tbe  appropriate  number  of  speakers  from  he 
cluster  This  was  done  1000  times,  and  the  mean  entropy  of  the 
18-member  clusters  were  computed.  The  mean  entropy  of  the  18- 
member  dusters  from  the  630-speaker  data  was  8.30,  and  for  a  sin¬ 
gle  speaker  from  the  multi-repetition  data  was  8  86,  approximately 
18°ci  lower.  This  suggests  the  possibility  of  signidcant  individual 
speaker  adaptation  beyond  the  choosing  of  appropriate  clusters. 


DISCUSSION 

The  studies  described  in  the  previous  four  sections  have  demon¬ 
strated  some  types  of  structure  m  the  phonological  variation 
observed  in  a  data  set  consisting  of  two  sentences  (21  words)  read  by 
many  spe.dcers.  In  addition,  we  have  shown  some  types  of  lexical 
representations  that  might  be  used  lo  capture  this  structure 
Representations  were  compared  by  measuring  their  entropy  or 
predictive  ability.  It  is  assumed  that  lower  entropy  can  lead  to 
improved  recognition  performance  In  the  near  future  we  intend  to 
test  this  assumption  in  a  series  of  recognition  performance  studies 

The  results  described  above  have  a  number  of  implications  for  sys¬ 
tem  design  The  first  study  suggested  that  a  significant  advantage 
in  recognition  accuracy  can  be  gained  by  incorporating  pronuncia¬ 
tion  probabilities  in  a  lexical  model.  The  major  problem  in  incor¬ 
porating  such  knowledge  into  ,irge  vocabulary  systems  s  finding 
suhicient  amount*  of  '.raining  data  to  adequately  estimate  ailophone 
probabilities  for  the  segments  of  each  word  in  the  vocabulary.  A 
possible  solution  to  this  problem  is  lo  use  knowledge  of  phonological 
rules,  rule  groups,  and  the  co-occurrence  of  allophonic  forms  to 
reduce  tbe  number  of  independent  nrobahiliti")  being  estimated. 

The  second  study  showed  co-occurrence  relationships  between  ailo- 
phonic  forms.  In  addition,  in  automatic  clustering  technique  was 
demonstrated  that  could  be  used  to  model  this  co-occurrence  for  a 
data  set  without  explicit  knowledge  of  what  these  -o-occurrences 
ire  This  result  suggests  that  lexical  representations  can  be 
improved  by  including  a  small  number  of  sets  of  word  models,  *ach 
trained  on  an  appropriate  cluster  of  ?.  large  data  set.  When  scoring 
sequences  of  word  pronunciation  hypotheses  for  an  utterance,  each 
sequence  would  only  include  one  set  of  word  model  probabilities. 

The  last  two  studies  suggest  methods  of  adaptation  lo  a  new 
speaker,  aa  well  as  ongoing  adaptation  within  a  session  with  a  single 
speaker.  In  figure  7,  H(S|cl  is  shown  to  be  considerably  lower  than 
H(S).  This  suggests  that  predicting  the  appropriate  cluster  for  an 
utterance  can  reduce  entropy  considerably  by  allowing  the  search  lo 
be  confined  to  the  model  of  a  single  cluster 

The  third  study,  which  explored  the  consistency  with  which  a 
speaker  remains  in  a  cluster,  suggests  that  predicting  the  cluster  'or 
an  utterance  cannot  be  achieved  solely  by  speaker  adaptation,  since 
a  speaker  will  not  stay  in  a  single  cluster  consistently.  However,  the 
f.h.rd  study  Joes  suggest  that  H(S|c)  can  be  approached  by  choosing 
an  appropriate  weighting  function  over  all  the  -lusters,  given  iome 
experience  with  a  speaker.  Furthermore,  these  results  suggest  that 
knowledge  of  speech  rat*  can  be  used  to  improve  prediction  of  the 
appropriate  cluster  for  an  utterance.  Ongoing  adaptation  might  be 
achieved  by  periodically  recomputing  the  weighting  function.  We 
have  not  explored  the  question  of  when,  or  how  often,  should  this 
weighting  function  be  recomputed. 

The  result*  of  the  fourth  study,  comparing  intra-speaker  lo  intra- 
cluiler  entropy,  show  greater  consistency  within  a  single  speaker 
than  within  the  clusters  found  in  the  previous  studies.  This  suggests 
that  speaker  adaptation  can  be  improved  bevond  the  choice  of  clus¬ 
ters  by  further  refinement  of  model  parameters,  based  in  extended 
experience  with  a  speaker.  The  major  problem  with  individual 
speaker  adaptation  s  that  model  parameters  have  to  be  ‘■stinj.aied 
from  a  small  amount  of  data  for  tbe  speaker.  The  advantage  of 
adaptation  bv  cluster  choice ’s  that  the  duster  couid  be  w-ll  framed 
on  iarg'  amounts  r '  ta.  The  problem  of  insufficient  data  for  indi¬ 
vidual  speaker  adaptation  'an  possibly  be  bandied  by  exploiting 
knowledge  about  phonoloincai  rules  rule  groups  the  co-occurrence 
of  ailophonic  forma,  and  iraplicational  rule  hierarchies,  in  order  to 
decrease  the  number  of  parameters  being  climated  aa  well  as 
increase  the  number  of  samples  for  each  parameter  We  intend  to 
explore  methods  for  doing  this  in  future  worK. 
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ABSTRACT 

This  f  .per  describes  alternative  approaches  to  lexical  aecesa  in  the 
CMU  ANGEL  speech  recognition  system.  One  approach  explores 
scoring  alternatives  within  the  fri  mewirk  of  the  CMU  module.  In 
another  approach,  the  asynchronous  phonetle  hypotheses  generled 
hy  the  icoustic-phonetics  module  are  converted  to  a  directed  groph. 
Tliis  graph  is  compared  to  a  pronunciation  dictionary.  Performance 
results  for  the  approaches  and  the  original  CMU  approach  were 
similar.  An  error  *nalysi>  indicates  promising  directions  for  further 
work. 


OVERVIEW 

A  lexical  access  subsystem  can  be  divided  into  two  major  com¬ 
ponents.  One  component  is  a  Itiiton;  a  data  structure  that  contains 
a  list  of  words  and  a  representation  of  the  allowable  pronunciations 
of  those  words.  Those  pronunciations  may  have  associated  probabili¬ 
ties  and  the  probabilities  may  he  dynamic  in  nature.  That  is,  they 
may  change  due  to  new  estimations  of  speaker-type,  speech  style, 
and  so  on. 

The  other  component  is  the  cereA  end  itorinf  mtesanism.  This 
compares  the  output  of  an  acoustic-phonetics  module,  with  the  lexi¬ 
con  and  determines  the  word  sequence  that  with  highest  probability 
corresponds  to  those  outputs.  In  doing  so,  the  search  and  scoring 
module  must  take  into  account  the  characteristics  of  the  AP  output, 
such  as  insertion,  deletion,  and  substitution  probabilities. 

This  paper  evaluates  itern alive  search  and  scoring  mechanisms  in  s 
lexical  access  module 

GOALS 

This  is  a  progress  report  on  work  at  SRI  International  in  cooperation 
with  Carnegie-Mellon  University  'CMT  )  and  sponsored  hy  DARPA 
SRI  is  exploring  alternative  approaches  to  lexiest-icces*  in  the 
framework  of  a  speech  recognition  system  (ANGEL)  being  developed 

at  CMU  |lj  The  ANGEL  system  it  designed  to  recognise  a  large 
vocabulary  from  American  English  continuous  -peech.  Our  goal  is 
u>  devise  an  approach  to  lexical  access  the  t  be  i  takes  advantage  of 
all  information  available  from  sther  knowledge  sources  in  tlie  speech 
recognition  s- Hem  (particularly  the  acoustic-phonei  *  nowledge 
source),  and  also  is  resilient  in  the  face  of  errors  made  by  *hose 
other  knowledge  sources. 


’Thu  rmveS  -u  sponsor***  by  Dtftett  Advanced  Restareh  Prs|«u  Agency 
Contract  -00039-S&.C 0302.  Tst  -lews  tad  conclusion*  contain**!  in  thin  document 
ace  tone  of  the  authors  nod  "hould  not  be  Interpreted  u  repreeenting  the  ■ifiriii  pol¬ 
icies.  either  expressed  or  implied,  of  the  Defenee  Advanced  Research  Proiecu  Agency 
or  the  US  Coeernneni 


OVERVIEW  OF  THE 
CMU  LEXICAL-ACCESS  MODULE 
(circa  summer  I98J) 

A  lexico-centric  block  dr  gram  of  the  ANGEL  speech  recognition 
system  if,  shown  in  Figure  1.  The  acoustic-phonetic  module  (!)  out¬ 
puts  a  set  of  phonetic  hypotheses  (see  Figure  2)  to  the  lexical- -access 
module  These  lattices  contain  “firings''  that  give  the  estimated 
probabilities  of  segment*  occurring  in  particular  time  intervals. 
However,  the  relative  probability  of  one  tiring  versiu  another  is  net 
estimated,  even  if  the  two  firings  overlap  it  in  time  This  is  because 
the  acoustic- phonetic  module  is  made  up  of  a  set  of  independent  seg¬ 
ment  luvalors  and  claasifiers. 


Figure  1. 

A  lexico-centric  block  diagram  of  the  ANGEL 
speech  recognition  system 


The  1088  version  of  CMU’s  lexicai-ac.ew  module  converted  this  lat¬ 
tice  structure  into  in  infcfrsffd  /allice  The  iattice  integrator 
created  boundaries  where  acoustic-phonetic  segments  began  ard 
ended  and,  in  particular,  created  new  boundaries  where  segments 
lapped.  It  collapsed  information  from  the  acoustic-phonetic  lat¬ 
tices  hy  combining  the  probability  estimates  from  overlapping 
acoustic- phonetic  segments  to  derive  likelihoods  of  the  newly  created 
segments.  An  example  of  'bis  int-graied-iattice  data  is  shown  in 
Figure  2. 
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Aeon,  tic  phonetic  lexical-access  Interface 
data  structures 


The  integrated  lattice  w  a  matched  to  a  dictionary  network  that 
represented  all  dlowed  pronunciations  of  words.  The  match  algo¬ 
rithm  allowed  one  or  more  integrated-lattice  segments  to  match  a 
particular  dictionary  segment.  A  scoring  algorithm  computed  the 
costs  of  matching  paths  in  the  dictionary  and  the  integrated  lattice 
The  scoring  was  based  on  the  integrated-lattice  likelihoods,  com¬ 
parisons  of  expected  durations  of  dictionary  egments  with  durations 
of  corresponding  concatenated  mtegnted-laltice  egmenta,  and  com¬ 
parisons  of  the  integrated-lattice  segments  with  an  independent 
coarse  labeling  of  the  input  speech  (based  on  the  ZAPDASH  coarse 
labeler  (3j). 


The  lexical-access  module  also  computed  anchor  regions  (see  Figure 
2)  from  the  waveform.  Anchor  or -ion*  define  possible  syllable  and 
word  boundaries  as  lhT  consonantal  region*  between  vocalic  regions 
specified  by  the  ZAPDASH  coarse  labeling  routine 


The  match  routine  searched  for  words  between  all  (re-sonable)  pairs 
of  anchor  regions.  The  match  routine  did  so  by  looking  for  prtterns 
in  tbe  integrated  lattice  tb  it  matched  patterns  in  its  lexicon  The 
match  routine  passed  matches  and  their  likelihood  to  tbe  syntactic 
module  The  syntactic  module  hypothesised  sentences,  verifying  the 
word  junctures  with  the  lexical-access  module 

MODIFICATIONS  TO  THE  CMU  SYSTEM 


SRI  decided  to  explore  a  number  of  alternative  scoring  and  search 
algorithms  to  determine  how  to  make  the  best  use  of  the  informa¬ 
tion  contained  in  the  integrated  lattice  We  limited  our  search  of 
dternative  lexical  scoring/search  algorithm*  to  variations  of  the  fol¬ 
lowing  CMU-lexical-access  characteristics. 


1.  Lexical  phonemes  were  required  to  begin  and  end  at  the  boun¬ 
daries  of  the  integrated  lattice. 

2.  Phoneme  scores  were  weighted  by  the  duration  of  the  phoneme 

3.  Phoneme  scores  included  penalties  if  the  duration  was  below  or 
above  a  preset  minimum  maxinvin  duration. 


search  of  alternative  lexical  scoring  algorithms  tried  to  maximise  the 
rank  of  the  correct  word  given  the  known  begin  and  end  times  of 
that  word.  Our  heat  algorithm  of  this  type  is  described  below: 

1.  Phoneme  scores  were  not  weighted  by  their  duratiun.  Therefore 
the  score  of  a  word  candidate  was  the  average  of  the  individual 
phoneme  scores  (In  the  CMU  algorithm,  the  word  score  is  the  dura¬ 
tion  weighted  average  of  the  individual  phoneme  scores.) 

2  Lexical  phonemes  were  allowed  to  begin  or  end  every  10  msec.  A 
minor  degradation  in  performance  was  observed  when  phonemes 
were  required  to  begin  and  end  at  the  boundaries  of  the  integrated 
lattice 

3  The  preset  minimum/maximum  duration  constraints  were  used  as 
hard  limits  on  the  allowable  duration  of  a  phoneme. 

4  The  cosrse  labeling  information  was  not  used  to  modify  the 
phoneme  scores 

Evaluation  and  Testing  Data 

The  matcher  and  lattice  integrator  portions  of  the  CML'  lexical- 
access  module  were  compared  with  the  alternative  routine  desc.ibed 
aoove  These  modules  were  tested  on  100  "Electronic  Mad"  sen¬ 
tences,  part  of  a  larger  database  collected  at  CML'  In  this  data 
each  of  10  speakers  said  ten  sentences  c  omposed  from  a  339  word 
vocabulary.  An  example  sentence  is  “Send  t  message  la  Smil/i  el 
CMU  "  The  data  was  hand-labeled  at  CML'  and  the  outputs  of  the 
CMU  acoustic-phonetic  module  and  CMU's  anchor  generator  module 
were  sent  to  SRI.  All  CML'  modules  for  this  study  are  circa  mid 
108o  All  testing  was  done  on  this  conlinuouirapeech  database  in  a 
speaxer-independenl  manner  with  no  grammatical  constraint? 

The  results  shown  in  the  tables  below  are  the  percent  correct  words 
in  the  text  set  for  the  lexical  access  algorithms  given  correct  anchor 
region*  nd  hand-set  endpoint* 


-LexiMLAccff,.  Ecrfo--majice_wjih  Hand-Set  Endpoint* 

Rank 

Alternate 

CMU  Siftcm  Simulation 

Correct 

.55 

4b 

Top  3 

.79 

71  | 

Top  10 

M 

8S 

Lexical 

ces*  Performance,  Using  Andys- 

Rank 

Aflemale  CMU  Renrled  Reetlte 

Correct 
Top  3 
Top  10 

.30  32 

.55  55 

.81  .76 

la  our  results  show,  the  performance  advantages  gained  on  known 
endpoints  did  not  translate  into  an  advantage  when  endpoint  inter¬ 
vals  were  used  However,  we  believe  that  this  scoring  algorithm 
might  result  in  an  improved  word  accuracy  if  sentence  hypotheses 
are  constructed  from  the  individual  word  hypotheses  Our  current 
research  has  therefore  been  Mined  at  generating  sentence 
hypotheses. 


COLLECTED  LATTICES 


4  Phoneme  scores  included  penalties  when  their  phoneme  type  did 
not  agree  with  the  eoarse-labeling  information 

Our  goal  was  to  j*e  how  well  a  lexical  coring  algorithm  could 
hypothesise  words  if  you  knew  where  the  words  begin  and  end.  The 


SRI  aiso  explored  a  modification  to  the  mid- 1 986  CNR  system 
which  eliminated  the  integrated-lattice  component  and  hypothesised 
words  directly  from  a  structure  mire  similar  lo  the  acoustic- 
phonetic  output. 


We  felt  that  a  significant  amount  of  information  was  benif.  lost  by 
the  integr iled-kltice  component.  For  instance,  segmentation  deci¬ 
sions  made  by  the  acoustic-phonetic  mcdule  weie  in  effect  over-ruled 
by  the  matcher  using  the  integrated  lattice.  Similarly,  decisions 
about  relative  likelihoods  of  overlapping  acoustic-phonetic  firings 
were  made  by  the  integrated  lattice,  when  such  decisions  should 
have  been  made  by  the  acoustic-phonetics  module 

However  the  acoustic-phonetic  output  without  the  lattice  integrator 
was  not  amenable  to  direct  lexical  access  Often  the  string  of 
correct  phon-tic  hypotheses  were  in  the  lattice,  but  either  one 
correct  phonetic  segment  overlapped  with  the  next  correct  one  or 
^as  separated  from  it  by  a  small  interval  of  time.  This  problem  was 
solved  when  using  the  lattice  integrator  by  splitting  segments  at  all 
overlap  points  and  by  allowing  single  dictionary  segments  to  match 
a  scries  of  integrated  lattice  segments. 

A  new  representation  of  the  acoustic-phonetic  dua  was  chosen  (the 
connected  It! lice)  that  retains  the  information  in  the  acoustic- 
phonetic  lattice  such  a a  duration  of  segments  and  reduces  the  possv 
ble  search  space  by  not  introducing  additional  boundaries  at  seg¬ 
ment  overlaps 

Conversion  to  the  Connected  Lattice 

An  algonthm  that  transforms  me  acoustic-phonetic  lattice  to  the 
connected  lattice  works  as  follows: 

The  acoustic-phonetic  lattices  are  converted  to  a  simple  directed 
graph,  the  connected  lattice.  There  are  two  types  of  arcs  in  this 
graph.  AP  ARCS  are  created  by  replacing  each  acoustic-phonetic 
firing  with  an  arc  going  from  a  node  representing  the  start  time  of 
the  firing,  to  a  node  r-prrsenling  the  end  time  of  the  firing.  CON¬ 
NECT  ARCS  are  created  between  all  nodes  that  have  incoming  AP 
arcs  and  all  nodes  within  100ms  of  these  nodes  that  have  outgoing 
AP  arcs  Connect  arcs  are  necessary  because  without  them  there 
typically  would  not  be  a  connected  path  between  the  start  and  end 
of  a  sentence. 

Output  probabilities  are  ssigned  to  the  AP  arcs  These  probabili¬ 
ties  are  the  product  of  a  vector  and  a  matrix.  The  vector  consists  <f 
the  probabilities  of  phonetic  segments  as  assigned  by  the  acoustic- 
phonetic  module  in  the  lime  interval  'orresponding  to  the  AP  arc 
(or  acoustic-phonetic  firing).  The  matrix  is  a  segment-confusion 
matrix  corresponding  to  the  observed  performance  of  the  acoustic- 
phonetic  module.  Probabilities  an  also  issigned  to  the  connect  v— 
in  a  context  dependent  manner  that  makes  more  reast  table  con¬ 
nects  more  likely.  This  is  described  below. 

Search  of  the  Connected  Lattice 

A  search  is  performed  to  compare  the  system's  lexicon  (stond  in  a 
pronunciation  graph)  with  the  connected  lattice.  Tuples  consisting 
of  the  initial  lexicon  node  and  all  initial  nodes  in  the  connected  lat¬ 
tice  (corresponding  to  all  permissible  word  starting  points  given  the 
anchor  regions)  are  placed  on  a  list  of  active  paths.  The  items  in 
the  list  are  called  partial  paths  The  search  algorithm  proceeds  by 
taking  a  partial  path  off  of  the  list,  extending  the  path  in  all  possible 
wavs2  (the  product  of  every  lexicon  arc  leaving  the  lexicon  node  at 

the  end  of  the  partial  path,  and  every  AP  or  connect  arc  leaving  the 
connected  lattice  '  •  at  the  end  of  the  partial  path).  These  new 

paths  are  olac-d  bacK  in  the  list.  Paths  that  are  complete  (that  end 
in  the  end  anchor  region  |  are  Uso  placed  in  a  list  of  complete  paths. 

Partial  paths  ( sets  of  associations  of  dictionary  segments  and  con¬ 
nected  lattice  arcs)  are  scored  as  the  sum  of  the  log- probabilities  of 

^The  swell  kigorithm  ion  lot  lilow  ptlht  to  'ocp,  nor  as  paths  have  two 
consecutive  connect  secs  or  begin  with  sr  end  with  connect  tecs. 


the  components  of  the  path.  A  component  probability  for  an  con¬ 
nected  lattice  ia  the  probability  of  the  dictionary  segment  iu  the  set 
of  output  probabilities  of  the  associated  AP  arc  The  component 
probability  for  connect  area,  which  hive  no  asaocialed  dictionary 
segment,  is  a  function  of  the  length  of  the  connect  arc  relative  to 
the  lengths  of  the  AP  arcs  that  surround  '.hem.  For  instance  two 
long  AP  Arcs  connected  by  a  short  connect  arc  would  have  a  much 
higher  probability  than  two  short  AP  arcs  connected  ov  a  long  con¬ 
nect  arc. 

It  is  the  function  of  the  connect  ares  to  permit  AP  arcs  to  connect 
reasonably  without  affecting  lha  score  of  the  patha,  however  unrea- 
-onablc  sequences  of  AP  ares  ire  inhibited  by  the  scoring  of  the  con¬ 
nect  area.  The  connect  arcs  Uso  in  effect,  lessen  the  effect  of 
premature  segmentation  decisions  made  by  the  icousiic-phonetic 
module. 

Evaluation 

The  above  algorithm  waa  evaluated  using  CMV’s  100  electronic  mail 
sentences  described  above  The  results  are  summanied  in  the  tables 
below. 


Lexical  Acceaa  Performance 

Using  Anchor  Regions 

reported  results) _ 

Rink 

CMU 

Connected  Lotticc 

(3S4  word tj 

(S40  words} 

correct 

.32 

.35 

Top  3 

.55 

.52 

Top  10 

.78 

.74 

Lexical  Acceaa  Performance  with  Hand- Set  Endpoints 
(C\1U  system  wss  simulated  at  SRI) 

Ronk 

CMU 

(S40  words} 

Connected  Lolttet 
(t40  words} 

correct 

.45 

.03 

Top  3 

71 

.79 

Top  5 

SO 

84 

Top  10 

88 

.92 

Top  20 

90 

M  _ i 

DISCUSSION 

The  recognition  results  above  show  similar  performance  for  the  two 
modules  A  closer  examination  of  the  data  revealed  that  the  con¬ 
nected  lattice  module  tended  to  have  catastrophic  errors  Such 
errors  typically  occurred  '*hen  one  of  the  proper  segments  was 
deleted  by  the  acoustic-phonetics  modules  This  version  of  the  con¬ 
nected  'attice  search  algorithm  was  not  equipped  to  deal  with  such 
problem'  For  instance,  of  th-  17  words  not  in  the  top  20  choices 
for  the  connected  lattice  system  with  hand-set  endpoints,  16  were 
caused  by  the  AP- module 's  failure  to  spot  one  or  more  segments  in  a 
word.  One  error  wao  due  to  a  speaker's  mispronunciation  of  that 
word.  Of  the  worda  that  were  in  the  top  6  through  19,  the 
overwhelming  majority  had  the  segments  there  but  with  iow  proba¬ 
bilities 

In  order  to  -.often  the  effect  of  .AP  deletions  vet  continue  to  taxe 
advantage  of  the  acoustic-phonetic  data,  new  con  d  lattices 
being  designed  should  include  insertion  deletion,  and  substitution 
probabilities  (separate  from  phonological  insertion,  deletion  and 
substitution)  for  segments  based  on  a  model  trained  with  acoustic- 
phonetic  module  output.  Probabilities  ror  connect  ires  will  be 
estimated  from  similar  data,  however  in  later  systems  it  is  hoped 
that  the  acoustic-phonetic  module  will  ano  provide  some  informa¬ 
tion  about  the  reasonableness  of  intcr-scgmen?  ;unctures. 
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A  proposal  that  descries  i  new  interface  between  the  acoustic- 
phonetic  module  and  the  lexical-access  module  a  included  in  the 
appendix  to  tbit  paper 
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APPEND DC 

Proposed  New  Iaterfac  between 
Aeouatlo-Phor*etlca  and  Lexical  A c  - '  J» 
(alrcvittcd  form  of  a  memo 
circuited  Fall  1188) 


OVERVIEW 

The  goal  of  this  proposal  it  to  dehne  an  mtirface  between  the 
Acoustic-Phonetics  (AP)  module  and  the  Lexical-Acces*  (LA)  module 
that  will  improve  overall  system  performance 

The  main  thesis  of  this  proposal  is  that  some  hard  decisions  (typi- 
"ally  segmentation  decisions)  that  are  made  in  the  AP  module  are 
better  left  to  the  LA  or  even  syntax  modules,  using  probabilities 
assigned  by  the  AP  module.  For  instance,  an  acoustic-phonetic  net- 
work  that  provides  probabilities  for  different  segmentations  might  be 
used  to  avoid  many  of  tbe  problems  of  deleted  or  split  segments, 
after  dictionary  and  syntactic  constraints  are  applied. 

The  second  thesis  of  this  proposal  is  that  the  performance  of  the  AP 
module  should  be  evaluated  in  the  context  of  the  LA  module  and 
vica  versa.  This  implies  that  to  improve  system  performance  a  tight 
feedback  loop  should  be  established  for  developing  the  two  modules 
For  example,  a  new  network-based  AP  module  should  be  frequently 
precented  (perhaps  even  in  half-baked  form)  to  the  LA  group,  who 
should  then  evaluate  word  hypotheser  discover  specific  ares*  in  the 
AP  data  as  well  as  in  the  LA  algorithms  that  need  the  most 
improvements,  and  feed  this  back  to  the  AP  people  for  further 
re  linemen  ts. 

GENERAL  DISCUSSION 

We  have  come  to  the  conclusions  that  perf  armance  of  the  top  word 
hypothesis  mode  by  the  lexica]  access  module  is  best  without  a  laic 
lice  integrator.  However,  without  such  an  integrator,  AP  errors  are 
more  serious  'austug  a  higher  percentage  of  words  not  to  be  included 
in  the  top  20  words  hypothesized  by  the  LA  module  The  proposed 
interface  between  the  two  modules  is  expected  to  help  solve  '.his 
problem. 

The  following  problems  are  listed  irt  order  of  seriousness  of  error 
(those  causing  the  most  proclems  to  those  causing  the  least)  based 
on  an  analysis  of  errors  made  by  our  current  LA  algorithms: 


1)  PHONF  DELETION  ERRORS 

The  AP  module  often  combines  adj  cent  phones  into  a  single  p)  one 
with  no  competing  two-phone  hypothesis  This  can  cause  a  fatal 
error  for  word  hypothesis  routines  that  do  not  account  for  AP  seg¬ 
ment  deletion,  since  the  lexicon  only  accounts  for  phonological 
deletions  and  for  some  generalisations  about  AP  deletion  In  fact, 
phone  deletion  and  insertion  are  the  major  causes  of  words  missing 
from  the  jnon-laltice-integTated)  word  lattice  in  our  experience 

The  lattice  integrator  deals  with  phone  deletion  error*  by  overseg¬ 
menting,  that  is  by  cresting  extra  boundaries  on  all  overlaps.  By 
creating  extra  boundaries,  the  word  network  can  be  traversed  in 
cases  where  it  could  not  be  without  a  lattice  integrator  Tina  allows 
the  correct  word  to  be  included  in  the  lop  20  words  hypothesized 
more  often,  but  often  permits  incorrect  wo-ds  to  achieve  better 
scores  than  the  correct  word. 

Ideally,  the  AP  module  would  hypothesize  m-nv  segmentations 
though  some  might  have  low  probabilities.  It  Is  inevitable,  however, 
that  some  deletions  will  occur.  Therefore,  statistics  for  estimating 
the  probability  of  deleted  phones  are  ne-essary  for  the  LA  module 
First-order  statistics  such  as  "the  probability  that  an  /ih/  is  deleted 
anywhere"  can  be  computed  on  a  large  data  base  lucb  as  the  one 
that  will  be  used  to  estimate  phone  confusion  probabilities. 
Second-order  stall-ties,  such  as  probability  of  deleting  an  /ih /  after 
an  /iy/  (as  in  the  word  "ceeceeing"),  m-y  be  more  desirable  In  thi3 
example,  although  the  general  probability  of  deleting  a  vowel  may 
be  low,  deleting  a  vowel  in  the  context  of  another  vowel  m»y  be 
much  more  likely.  Higher-order  statistics,  such  as  the  probability  of 
deleting  segments  in  particular  word*,  may  be  even  more  helpful  for 
certain  high  frequency  words 

2)  PHONE  INSERTION  ERRORS 

The  AP  module  often  splits  pbones  oi  iuserts  spurious  phones  which 
affects  the  LA  module  In  ways  similv  to  phone  deletion  Error 
statistics  can  be  computed  for  phone  insertion  similar  to  those  for 
phone  deletion.  Furthermore,  a  AP  post-processor  might  examine 
th*  AP  data  for  consecutive  similar  segments  and  creates  an  addi¬ 
tional  AP  segment  if  it  decides  that  there  is  positive  probability  that 
these  two  segments  represent  one  underlying  segment. 

3)  ANCHOR  ERRORS 

The  current  lexical  access  algorithm  use*  anchor  regions  based  on 
ZAP  DASH  analysis  to  p-up-ose  regions  where  words  may  star' 
Although  the  daiL.  has  been  made  that  08*^  of  all  words  .ire  found 
by  this  analysis,  tbe  boundaries  proposed  by  ZAPTASII  are  not 
necessarily  even  close  to  the  appropriate  phone  boundaries  produced 
by  the  AP  module.  Clearly,  anchor  generation  must  be  synchron¬ 
ized  with  the  phone  alignment!  Our  lexical- access  analvses  for  sys¬ 
tems  without  lattice  integration  do  not  use  anchor*  but  rather 
hand-m  irked  times  'orresponding  to  \P  lattice  output  that  mav 
then  be  Nullified." 

4)  ERRORS  OF  PHONF  ALIGNMENT 

The  proper  phones  may  be  loerted  and  classified  hut  they  may  be 
erroneously  aligned.  This  is  shown  in  the  figure  below  for  the  word 
*TV  •*«*  >  t  iy  v  iy 


59 


I II  III  II 1 1  1 1 1 1 1  1 1 1 1  It  1 1 

overlap  overlap  Q*p 

and  onset 
a  1 ignment 
problem 

Figure  3. 

Alignment  problems  id  the  AP  module 


Adjusting  the  alignment  of  segmenu  is  important  when  recognising 
from  current  AP  lattice!  Better  alignment  can  he  accomplished  by 
requiring  locators  for  different  segments  to  use  common  information 
(such  ar  a  coarse  labeling  of  the  input  speech)  when  making  align¬ 
ment  decisions 

One  current  SRI  lexical  access  algorithm  uses  heuristics  to  solve  this 
problem.  This  is  not  the  most  desirable  solution,  but  in  the  absence 
of  acoustic-phonetic  information  to  determine  the  goodness  of  align¬ 
ment  choices  it  is  an  expedient  choice.  Overlaps  or  gape  were 
penalised  more  with  an  increased  ratio  of  the  length  of  the  overlap 
or  gap  to  the  sum  of  the  lengths  of  the  segment  connected  by  the 
overlaps  or  gaps. 

In  contraal  the  lattice  integrator  solved  the  phone-dig?. mint  prob¬ 
lem  by  creating  new  AP  boundaries  at  overlaps  and  by  ignoring 
gape  if  there  were  no  lab  sled  segnrnu  in  the  gap. 

Any  alignment  scheme  should  aim  towards  illowing  phones  to  follow 
each  other  reason  ably,  while  disallowing  incorrect  phones  with  too 
much  overlap  or  gap  Since  absolute  de'isions  can  not  be  made  with 
100°?  certainty,  alternative  segmentations  ihould  be  provfded. 
When  alternative  segmentations  are  provided,  probabilities  of  these 
segmentations  should  be  -sum  .ted  as  well.  It  is  more  appropriate  to 
hive  the  AP  module  to  apply  acoustic-phonetic  information  to 
evaluate  the  junctures  between  segmental  hypotheses,  than  to  have 
the  LA  module  make  these  decisions. 

3)  ERRORS  OF  PHONE  S’tBSTITUTION: 

With  phone  substitution  errors,  the  correct  region  for  a  phone  >s 
found,  but  the  -orrect  phone  is  'abeled  with  little  or  no  probability. 
This  can  he  overcome  by  characterising  the  phone-substitution  pro¬ 
babilities  of  the  AP  module  over  a  large  data  baae,  and  using  this 
'phone  confusion  matrix*  in  lexical  access.  This  is  currently  bsing 
done  by  the  AP  module.  Of  course,  the  estimnlion  of 
P(phont)  |/aiefy),  the  confusion  matrix,  should  take  into  acccunt 
the  a  priori  probability  of  faiefy 

PROPOSAL 

One  possible  interface  is  outlined  below.  This  inlerf  me  starts  with 
an  AP  module  similar  to  the  -urrent  CMU  module,  but  with  better 
phone  alignment  (perhaps  based  on  the  ZAPD  tSH  segmentation 
scheme).  This  system,  however,  dso  explicitly  rates  the  probabilities 
of  lattice  phones  following  each  other  which  has  the  effect  is  of 
doing  'phone-juncture  verification"  for  all  nearby  phones  Also,  the 
probabilities  for  merging  iimilar  labels  are  explicitly  computed  by 
the  AP  module  This  system  ss  iben  statistically  chuacteriied  ;n 
terms  of  phone  substitution  orobabilities,  phone  deletion  probamli- 
ties  and  phone  merger  probabilities  All  this  information  is 
represented  in  a  dire-ted-gtaph  data  structure 


INTFRFACE  SPECIFICATION 

A  directed  graph  data  structure  is  output  by  the  AP  module  with 
the  following  characteristics: 

1.  ^ach  node  in  the  graph  represents  a  particular  point  in  time. 
Piere  may  be  several  nodes  corresponding  to  the  same  point  in 
time  This  may  happen,  for  instance,  if  a  particular  AP  event  is 
dependent  on  another  event  For  instance,  a  vowel  may  be  depen¬ 
dent  on  a  following  nasal.  Then  the  hypothesis  is  only  connected  to 
things  consistent  with  its  hypothesis. 

2.  There  are  arcs  leaving  each  node  in  the  graph.  These  arcs 
correspond  to  one  of  two  poaeible  things. 

(a)  The  firing  sf  a  locator  for  a  given  type  oi  phone  (an  \P  arc)  AP 
arcs  always  lead  to  other  nodes  (b)  The  possibility  that  a  locator 
did  not  are  for  s  given  phone  (an  insertion  are)  An  insertion  are 
always  leads  to  the  node  it  came  from. 

3.  There  are  two  probabuitres  associated  with  each  AP  arc  (above). 

<•)  the  probability  that  the  locator  firing  corresponding  to  the  are 
wae  valid,  P(AP-arc  |  node),  the  transition  probability  for  the  AP- 
irc.  (b)  the  set  of  output  probabilities  of  phones  given  th'  AP  arc  is 
valid,  P(phone  |  AP-  ire). 

4  Similarly,  there  are  two  probabilities  >s*oci  ted  with  each  inser¬ 
tion  arc:  the  probability  that  any  phone  can  be  inserted  at  the  node, 
which  is  P(inseidoo-ar.  |  node);  and  the  probability  of  a  particular 
phone  being  inserted  at  this  point  given  '-hat  there  wae  an  insertion, 
which  is  P(phone  |  insertion-arc),  the  output  probabilities  of  the 
insertion  arcs. 

5.  For  a  given  node,  the  sum  of  the  transition  probabilities  for  the 
arcs  leaving  that  node,  in  other  words  all  the  AP  arcs  plus  the  inser¬ 
tion  arc,  should  equal  1  0  Similarly,  for  a  given  ire,  the  sum  of  its 
output  probabilities  should  be  1.0. 

FINAL  CONSIDERATIONS 

PROBABILITIES  AUTOMATICALLY  ESTIMATED 

All  of  these  probabilities  should  be  automatically  estimated  from  'he 
outputs  of  the  AP  module,  so  that  system  changes  will  not  require 
the  tweaking  of  many  parameters 

Further,  higher  order  probabilities  are  desirable  when  there  is  ade¬ 
quate  data.  Thus,  for  example,  the  deletion  of  phone  in  context  (or 
in  word)  for  high  frequency  contexts  (or  words)  would  be  good  infor¬ 
mation.  Output  statistics  showing  probabilities  of  events  and  the 
number  of  events  used  to  estimate  these  probabilities  wou,d  be  use¬ 
ful  to  the  LA  group.  With  this  information,  higher  order  probabili¬ 
ties  can  be  ised  when  there  is  enough  training  data  to  make  them 
reliable,  and  lower  order  probabilities  can  be  osed  otherwise 

ALGORITHM  READJUSTMENT  WILL  BE  NECESSARY 

Although  this  proposed  interface  should  ultimately  improve  perfor- 
r  nee.  there  will  be  some  initial  problems  with  it.  These  problems 
should  be  worked  jut  in  the  context  of  the  lexical-access  module. 
We  propose  that  the  AP  group  initially  output  both  the  finish 
product  graph  and  intermediate  statistics  that  lead  to  that  graph 
These  include  the  initial  locator.’ classifier  decisions  statistics,  such 
as  tbe  confusion  m;,irix.  that  lead  to  the  ultimate  graph,  and  .nser- 
Uon  and  deletion  statistics  Information  on  tbe  methods  used  o 
assign  probabilities  o  new  arcs  should  a,  o  be  presented  to  the  lexi¬ 
cal  access  group 
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Abstract 

This  paper  presents  a  progress  report  on  the  develop¬ 
ment  of  a  fast  formant  tracker.  Our  present  effort  is 
focused  on  extracting  formants  1.  2,  3  and  -1  and  their 
amplitudes  from  voiced  speech.  Our  goal  is  to  produce 
formants  which  can  reliably  drive  the  recognition  of 
vowels  and  semivowels  in  the  context  of  speaker 
independent  continuous  sppech.  The  algorithm  is  based 
on  peak  picking  and  a  data  reduction  technique  which 
analyses  peak  frequencies  and  amplitudes  as  a  function 
of  time.  A  sidp  product  of  the  tracker  is  scementation  of 
voiced  regions  of  speech  that  shows  promise  for  use  in 
segmenting  °ome  phonemic  eiasses.  Evaluation  of  the 
tracker  on  350  utterances  from  the  D\RPA  Acoustic- 
Phonetic  database  indicates  that  30%  of  all  phonetic 
segments  are  tracked  correctly.  This  is  based  on  visual 
evaluation  of  the  formant  tracks  overlayed  on  spectro¬ 
grams.  Phonemically,  a  large  portion  of  the  errors  cluster 
arourd  / r/.  We  are  in  the  progress  of  adding  a 
retroflexion  detector  developed  at  NBS  (Gcngel,  Majurski 
and  Hieronymus  1087)  to  aid  the  tracker  in  these  areas. 


Introduction 

This  paper  gives  an  overview  of  our  current  algorithm 
for  tracking  formants  in  continuous  speech.  We  want  to 
use  formants  to  assist  in  machine  recognition  of  vowels 
and  semivowels.  At  the  outset  of  this  project  no 
sufficiently  acuratc  formant  tracker  existed  for  our  use. 
This  algorithm  produces  formant  frequencies  and  ampli¬ 
tudes  for  the  first  four  formants. 

Our  goal  was  a  formant  tracker  which  was  accurate,  fast 
and  structured  to  extract  as  much  information  from  con¬ 
tinuous  speech  as  possible.  The  formant  tracker  employs 
a  peak-picking  algorithm  followed  by  a  peak- 

combination  algoritl  m.  The  peak-picking  algorithm 
parameterizes  peaks  by  both  their  frequency  and  ampli¬ 
tude  The  peak-combination  algorithm  develops  initial 
tracks,  called  ridges,  which  are  found  by  combining 
peaks  which  aie  similar  in  frequency  and  minimally 
separated  in  time. 


Front  End  Signal  Procesaing 

The  formant  tracker  uses  pitch  synchronous  dfts  as  its 
signal  processing  front  end.  We  are  currently  using  a 
pitch  tracker  and  synchronous  dft  routines  developed  at 
CMU.  During  voiced  speech  (areas  where  pitch  tracker 
fires)  a  variable  width  hamming  window  is  centered  on 
the  pitch  period.  The  window  size  is  tavlored  to  the 
pitch  period  length.  This  tends  to  produce  sppetra  in 
which  the  formants  are  represented  clearly.  For  peak 
picking  this  appear?  to  be  an  optimal  windowing  of  the 
waveform,  including  a  second  pitch  period  or  overlap 
ping  two  pitch  periods  in  a  single  analysis  window  would 
add  components  that  are  out  of  phase  thus  diminishing 
thp  clarity  ol  the  formant  structure.  Prccmphasis  of  G 
db  per  octave  is  used. 

Peak  Picking 

Spectral  peaks  are  selected  by  locating  the  negative- 
going  zero-crossings  of  the  first  difference  of  the  pitch 
synchronous  spectra.  Zero-crossings  which  occur  during 
negative  excursions  of  the  second  difference  mark  the 
locations  of  spectral  p"aks.  The  pitch  synchronous  spec¬ 
tra  are  used  unsmoothed. 

Each  peak  is  parameterized  by  its  quantized  height, 
frame,  and  bin.  Within  a  sonorant  region  the  range  of 
peak  amplitudes  in  db  is  quantized  into  ten  levels  by  a 
simph  linear  function.  Frame  refers  to  the  pitch  period 
the  peak  was  extracted  from.  Bin  is  a  frequency  measure 
corresponding  the  dft  bin  number  (0-127)  of  the  center  of 
the  peak.  Only  peaks  in  the  frequency  range  0  -  -1000 
Hz.  are  used  in  the  tracker. 

Peak  Combination  -  Simple  Ridge  Construction 
The  algorithm  groups  together  peaks  that  will  eventually 
belong  to  the  same  formant.  In  this  first  pass,  the 
combination-algorithm  is  very  conservative,  grouping 
peaks  that  are  very  similar  in  frequency  and  very  close  in 
time.  For  pach  ppak  in  a  region  of  voiced  speech,  left 
and  right  (earlier  in  time  and  later  in  time)  neighbors  are 
sought.  Only  peaks  within  2  time  frames  and  2  fre¬ 
quency  bins  are  eoasidcred  as  neighbors.  A  simple  dis¬ 
tance  measure  is  U3ed  wdien  multiple  choices  arp  avail¬ 
able. 
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The  matcher  produces  groups  of  linked  peaks  called 
ridges.  In  the  simplest  rase,  a  steady  state  vowel,  simple 
ridges  represent  the  formants.  This  leaves  undone  only 
the  assignment  of  ridges  to  formant  slots.  More  compli¬ 
cated  speech  patterns  require  more  sophisticated  algo¬ 
rithms  utilizing  more  context.  A  segmentation-algorithm 
at-sisL:  here  by  splitting  up  speech  in  places  where  clear 
breaks  in  ridge  patterns  occur. 

Onee  a  ridge  is  found  rmet  decisions  concerning  it,  use  a 
small  number  of  parameters.  Its  frequency  is  parameter¬ 
ized  as  minimum,  maximum  and  average  frequency.  Its 
amplitude  relative  to  other  ridges  is  parameterized  as 
strength  and  density  Strength  is  calculated  by  sum¬ 
ming  the  quantized  height  of  each  peak  in  the  ridge.  So 
it  is  a  function  of  both  the  relative  amplitude  of  the 
ridge  and  the  length  of  the  ridge.  Density  is  calculated 
as  the  average  quantized  peak  height  in  the  ridge.  So  it 
is  a  funrtion  of  only  the  relative  amplitude  of  the  ridge. 
In  the  remainder  of  this  paper,  the  terms  strength  and 
density  refer  to  these  definitons. 

Choosing  Formants  From  Ridges 
Ridges  are  assigned  to  formant  slots  based  on  both  fre¬ 
quency  and  strength.  Frequency  ranges  lor  each  of  the 
formants  are  determined  by  speaker  pitch.  In  voiced 
regions  the  average  pitch  (I /average  pitch  period  length) 
is  used  to  declare  the  region  as  having  low  (below  150 
Hz.),  high  (above  170  Hz.)  or  medium  pitch.  Formant 
ranges  are  then  taken  from  one  of  3  tables.  For  each 
formant  the  table  contains  a  range  of  frequencies  that 
are  unique  to  that  formant.  It  also  contains  ranges  of 
frequencies  for  which  the  formant  choice  is  not  clear.  A 
formant  with  frequencies  in  one  of  these  ranges  is  con¬ 
sidered  ambiguous  and  its  assignment  must  be  deter 
mined  in  the  context  of  the  other  formant  candidates. 
Seven  ranges  exist,  four  for  the  first  four  formants  and 


Formant  Frequency  Ranges 

• 

Low  Pitch 

Medium 

Pitch 

Rang-** 

mm 

IB' 

HI 

■aRnfl 

EBl 

laff*" 

E3SV 

■3HB 

■will 

ri 

100 

700 

140 

825 

175 

1000 

ri2 

700 

1200 

825 

1250 

1000 

POO 

f2 

1200 

1500 

1250 

1630 

1775 

f23 

1500 

2500 

1630 

2850 

3000 

f3 

2500 

2800 

2850 

3200 

m 

2800 

3200 

3200 

"BOO 

f4 

3200 

4000 

3600 

4200 

4400 

*  Uses  average  ridge  frequency 

**  fl,fa,f3,M  refer  to  frequency  ranges  unique  to  that  formant 

fl2  refers  to  frequency  ra.  ges  which  could  hold  fl  or  f2. 

f23  refers  to  frequency  ranges  which  could  hold  f2  or  f3 

134  refers  to  frequency  ranges  which  could  hold  f3  or  f  I 

Figure  1  -  Pitch  dependent  formant  frequency  ranges  used  in 
formant  dot  assignments.  Low  pitch  is  below  150  Hi  High 
pitch  is  above  170  111 


three  for  the  ambiguous  areas  in  between.  An  assign¬ 
ment  algorithm  is  used  to  select,  from  the  candidate 
ridges,  four  first  choice  formants  and  four  second  choice 
formants.  The  second  choice  formants  are  used  in  a 
later  pats  of  the  tracker.  In  the  assignor,  each  ridge  is 
parameterized  by  its  average  frequency  and  its  strength. 
The  top  candidates  (by  strength)  from  each  of  the  seven 
ranges  are  collected  and  formant  assignments  made 
stungest  ridge  first.  Ridges  in  ambiguous  frequency 
ranges  contend  for  two  formant  slots. 


Basic  Segmentation 

The  segmenter  subdivides  voiced  regions  in  a  way  that 
assists  in  the  assignment  of  ridges  to  formant  slots.  For 
some  phoneme  groups,  this  corresponds  roughly  to  a 
phonemic  segmentation.  Specifically,  it  attempts  to  seg¬ 
ment  nasals,  voiced  stops,  voiced  fricatives,  and  (laps, 
from  the  surrounding  voiced  areas.  We  have  found  these 
locations  to  harbor  a  majority  of  the  discontinuities  in 
formant  frequency. 


The  segmenter  uses  as  input  a  select  group  of  ridges 
called  important  ridges.  Ridges  are  selected  by  there 
dominance  in  a  region,  using  several  criteria. 

Ridges  that  occupy  one  of  the  first  three  formant  slots 
(first  and  second  choice)  are  used;  although,  in  most 
cases  the  second  choice  slots  are  empty.  Other  ridges 
may  be  included  based  on  their  density  measure  or  on  a 
measure  called  overlap.  The  density  measure  is  used  to 
include  ridges  which  have  very  high  amplitudes  but  do 
not  have  sufficient  duration  to  win  a  formant  slot.  The 
overlap  provision  finds  ridges  which  (1)  overlap  first 
choice  ridges  in  time,  (2)  are  minimally  seperated  from 
the  first  choice  ridge  during  the  overlap  and  (3)  have  a 
higher  density  during  the  overlap.  This  overlap  provi¬ 
sion  thus,  finds  places  where  the  initial  peak  matcher 
may  have  failed  to  follow  the  formant  correctly. 

Segmentation  clues  are  extracted  from  the  important 
ridges.  The  begining  and  end  frame  of  each  ridge  is 
labeled  as  a  clue.  Sudden  changes  in  amplitude  are 
labeled.  Sudden  changes  in  Fl  frequency  are  labeled. 

Segmentation  clues  associated  with  Fl  tend  to  mark 
obstruent  boundrics  such  as  nasals,  (laps  and  voiced 
stops.  Since  these  phonemic  events  tend  to  harbor  for¬ 
mant  discontinuities,  a  segment  boundrv  is  created  for 
this  type  of  clue  without  further  confirmation.  Segmen¬ 
tation  clues  associated  with  higher  formants  are  less  reli¬ 
able.  They  involve  more  variation  in  frequency  and 
amplitude.  They  also  are  more  heavily  influenced  by 
neighboring  frication.  Segment  boundries  are  only  esta¬ 
blished  where  two  or  more  clues  are  found. 
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Recursive  Segmentation 

The  segmenter  drives  the  sear-h  for  formants  recursively. 
Each  voiced  region  is  initially  labeled  as  a  single  seg¬ 
ment.  Basic  ridges  and  important  ridges  are  recomputed 
and  new  segment  boundries  are  searched  for.  New  boun- 
dries  cause  the  segmenter  to  subdivide  and  operate  on 
each  resulting  piece.  The  process  ends  when  no  further 
subdivisions  can  be  found. 


Use  of  Bark  Scaled  Spectra 

An  attempt  wa;  made  to  use  Bark  scaled  spectra  as 
input  to  the  formant  tracker.  (Seneff  1986)  We  had 
hoped  it  would  help  in  tracking  diffuse  upper  formants 
and  upper  formants  influenced  by  fricatives.  While  Park 
scaling  did  help  in  these  two  '•as»s  it  hurt  in  ways  we 
were  less  willing  to  cope  with.  Upper  formants  that  are 
dose  in  frequency  to  start  with,  such  as  high  front 
vowels,  and  retroflexion,  tended  to  be  merged.  We  found 
merges  very  difficult  to  handle.  Therefore  we  no  longer 
use  Bark  scaling  of  the  spectra  for  formant  tracking. 

Performance  Analysis 

The  CMl  and  NBS  formant  trackers  were  run  on  a  sub¬ 
jet  of  350  sentences  of  the  DARPA  Acoustic-Phonetic 
database.  The  formants  were  overlayed  on  spectrograms 
and  examined.  Only  the  voiced  phonemes  weie  con¬ 
sidered  as  valid  segments  for  statistics.  An  error  was 


Figure  -  -  Percent  errors  in  voiced  phonemes.  Performanc- 
on  350  sentences  from  the  DARPA  Acoustic-Phonetic 
Data  Base 


counted  if  any  of  the  three  formant?  were  not  within  the 
dark  bards  on  the  spectrogram  or  If  the  wrong  formant 
was  assigned.  In  total  there  were  6716  segments  exam 
lned.  The  results  of  this  analysis  are  shown  in  Figure  2. 


Upcoming  Changes 

Development  of  the  algorithm  is  not  yet  complete.  Per¬ 
formance  in  several  areas  can  be  improved.  We  have 
learned  much  from  our  first  pass  at  creating  a  segmenter 
based  on  formant  data.  We  will  soon  start  a  complete 
rewrite  of  the  segmenter  which  will  use  many  new  rules 
and  clues.  An  evaluation  and  more  complete  description 
will  be  published  later.  An  energy  based  retroflex 
detector(Gengel,  Majurski  and  Hieronymus),  already  run¬ 
ning  in  our  lab  will  be  used  to  help  make  decisions  dur¬ 
ing  /r/,  where  F2  -  F3  merges  are  common,  during 
retroflexed  vowels.  Since  this  algorithm  is  based  on  for¬ 
mant  continuity,  tracking  the  high  frequency  formants 
during  and  near  fricatives  is  a  special  problem  also 
requiring  solution. 

Implementation 

The  peak  picker  and  formant  tracker  are  written  in  C 
and  run  on  both  Unix  and  the  Symbolics  Lisp  Machine 
using  the  Zeta-Soft  C  cross  compiler.  All  development 
work  on  the  tracker  was  done  on  the  Lisp  Machine  using 
Spire.  The  Unix  version  has  been  delivered  to  CMU  to 
he  evaluated  for  use  in  their  system. 

Execution  speed  on  our  Vax/750  averages  50  seconds  per 
utterance.  On  CMU’s  Vax/7S0  it  averages  20  seconds 
per  utterance.  This  implementation  uses  fix“d  point 
arithmetic  only. 

Conclusion 

We  have  presented  an  overview  of  our  current  algorithm 
for  tracking  formants  in  continuous  speech.  We  have 
described  the  general  algorithm  for  tracking  and  for  seg¬ 
menting.  The  segmenter  is  currently  in  a  rudimentrv 
state  but  we  believe  it  holds  much  promise  for  segment¬ 
ing  voiced  speech.  Since  we  believe  that  formant 
analysis  is  critical  to  the  paramaterization  of  continuous 
speech.  We  will  continue  working  to  improve  the  perfor¬ 
mance  of  this  algorithm. 
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Abstract 

Results  are  described  for  a  "retroflexion"  detector 
designed  to  locate  the  acoustic  manifestations  of  /r/,  /j/, 
and  /$/  in  the  continuous  speech  of  male  speakers.  A 
SEARCH  analysis  indicates  99  percent  correct  detections 
for  a  faLe  alarm  rate  of  14  percent.  Missed  retroflexed 
tokens  fall  into  three  groups:  (1)  low  F4,  which  is  also 
close  to  F3;  (2)  "diminished  retroflexion."  attributed  to  a 
phonotactic  rule;  (3)  F2  and  F3  are  relatively  high  and 
straddle  or  exceed  the  specified  cutoff  for  male 
retroflexion. 

Introduction 

We  are  developing  in  automatic  "retroflexion"  detector 
bajed  on  energy  ratios.  It  is  designed  to  locate  the 
acoustic  manifestations  of  the  phonemes  /r/,  /$/,  and  /j- 
/  in  continuous  speech.  The  detector  will  not  be 
described  in  great  detail  here.  It  is  ba_sd  on  the  idea 
that  a  third  formant  below  2000  Hz  for  males  and  2300 
Hz  for  females  is  a  correlate  of  retroflexion.  A  set  of 
energy  ratios  has  been  formulated  and  tested  to  detect 
this  event.  In  its  current  form,  it  is  designed  to  detect 
waveform  segments  that  generally  meet  the  following 
three  criteria: 

1)  relatively  high  unergy  between  1000-2000  Hz  coupled 
with  relatively  low  energy  between  2000-3000  Hz  (Energy 
Diff  6); 

2)  relatively  high  energy  between  1400-2000  Hz  coupled 
with  relatively  low  energy  between  2000-3000  Hz  (Energy 
Diff  20); 

3)  relatively  high  energy  between  120-1100  Hz  coupled 
with  relatively  low  energy  between  2200-2800  Hz  (Energy 
Sum  4+5). 

The  current  version  of  the  detector,  called  Energy  Sum 
R,  was  developed  for  males  voices.  A  modified  version  of 
the  detector  is  also  being  developed  for  use  with  female 
speaicers.  Only  analyses  based  on  male  voices  will  be 
reported  here. 

F'gure  1  shows  analyses  of  two  sentences  from  the  data 
base  that  may  be  ssed  as  "canonical"  representations  of 
/r.l'.  4“/,  and  can  be  used  to  illustrate  the  underlying  logic 


of  Energy  Sum  R.  Note  that  when  the  criteria  are 
closely  met,  the  peak  in  the  Energy  Sum  R  display  is 
relatively  high.  This  generally  occurs  (l)  during  produc¬ 
tion  of  /r Stf/t  (2)  sometimes,  during  production  of  cer¬ 
tain  other  phonemes  coarticulated  with  /r,f,4/,  and  (3) 
sometimes,  during  transitions  of  other  phonemes  (see  /a/. 
U!  and  /k/.) 

Corpus  and  Method 

The  corpus  used  for  analysis  consisted  of  sentences  in  a 
subset  of  the  DARPA  Acoustic  Phonetic  Data  Bass  that 
are  spoken  by  males  speakers.  There  were  126  sentences 
which  contain  a  total  of  4465  tokens.  The  breakdown  of 
tokens  according  to  number  per  phoneme  is  shown  in  the 
upper  portion  of  Figure  2. 

In  order  to  perform  the  test  we  used  the  SEARCH  Pro¬ 
gram  developed  at  MIT  (Randolph,  1986).  The  phonetic 
labels  used  were  the  labels  provided  by  the  MIT  Group. 

The  SEARCH  program  was  set  to  find  all  labeled 
phonemes  which  contained  values  of  Energy  Sum  R 
greater  than  a  specified  threshold  value.  Thus,  included 
in  this  starch  are  all  phonemes  having  Energy  Sum  R 
values  above  threshold,  regardless  of  how  much  or  hew 
little  waj  the  value  above  threshold,  or  how  much  or 
how  little  of  the  segment  contained  the  suprathreshold 
value. 

Results 

The  results  of  one  SEARCH  inalysis  is  shown  in  the 
lower  portion  of  Figure  2.  Max  Energy  Sum  30 
represents  an  arbitrary,  but  reasonable  threshold;  reason¬ 
able  in  its  trade-off  between  correct  detections  and  false 
alarmj. 

Note  that  the  analysis  reduces  the  corpus  to  be  examined 
further  to  20  percent  of  the  original.  Contained  therein 
are  89  percent  of  the  target  phonemes  /rj'.e/  These 
"retroflexed”  phonemes  comprise  32  percent  of  the 
reduced  corpus. 

In  the  remainder  of  this  paper,  we  describe  some  charac¬ 
teristics  of  the  / rXe /  toxens  that  were  "missed"  by  the 
combined  detector.  In  a  companion  paper,  we  describe 
some  of  the  "faLie  alarms"  due  to  coarsiculation  •*tf»cts 


and  other  phenomena. 

Thirty  four  " retroflexed"  tokens  were  "missed"  by  the 
analysis:  25  /r/s,  4  /S/s,  and  5  /f/s.  These  34  tokens 
were  further  analyzed  to  determine  whether  they  pos¬ 
seted  characteristics  different  from  the  279  tokens  that 
were  correctly  detected. 

We  can  the  "normal"  retroflexed  token,  one  in  which  the 
energy  in  Fl,  F2,  and  F3  is  local  sd  below  2000  Hz,  while 
F4  remains  above  3000  Hz.  (See  Figure  1.)  In  15  of  the 
25  "missed"  /r/  tokens,  F4  dropped  below  3000  Hz. 
This  relatively  low  frequency  F4  caused  the  denomina¬ 
tors  of  the  energy  ratios  used  to  calculate  Energy  Sum 
R,  to  become  large.  This,  in  turn,  resulted  in  low  values 
of  ES  R  that  did  not  exceed  the  threshold  for 
"retroflexion"  detection.  Figure  3  shows  some  dramatic 
examples  of  F‘4  downward  movement  in  parallel  with  F3. 
Howpver,  the  movement  is  not  always  in  parallel  with  F3 
nor  as  dramatic.  The  characteristics  of  both  F3  and  F4 
of  the  adjacent  phoneme  determine,  in  part,  the  type  of 
F3  and  F4  movement  into  the  retroflexed  waveform 
region.  Presumably,  these  tokens  show  "retroflexed 
alveolar  articulation,"  as  contrasted  to  "retroflexed  pala¬ 
tal  articulation"  (Fant,  p.28,  1973).  The  former  are  also 
referred  to  as  "rx -^voiced,  continuant,  apical"  (op.cit.,  p. 
83).  This  phenomenon  requires  more  detailed  investiga¬ 
tion  since  it  is  not  restricted  merely  to  the  15  tokens  just 
mentioned.  It  also  occurred  in  two  "missed"  /S/  and  in 
three  "missed"  /£/  tokens,  as  well  as  in  tokens  where 
Energy  Sum  R  exceeds  threshold  during  a  portion  of  its 
waveform  duration.  The  latter  condition  can  be  seen 
clearly  in  the  upper  portion  of  Figure  1  where  F4  paral¬ 
lels  the  diminution  in  the  amplitude  of  Energy  Sum  R. 
(We  are  currently  developing  an  F4  tracking  program  to 
more  efficiently  identify  th>?  phenomenon.) 

A  second  phenomenon  which  is  evident  in  six  "missed" 
tokens  we  have  tentatively  labeled  as  "diminished 
retroftexion."  This  occurred  in  five  /r/s,  and  in  one  /if  . 
It  is  evidently  due  to  a  phonotactic  rule  employed  by 
some  persons  living  along  the  eajtern  seaboard  that 
states:  When  /r/  (or  other  retroflexed  token )  is  preceded 
by  a  vowel,  delete  the  /r/.  This  rule  is  also  used  in 
areas  of  Great  Britian  (Bristow,  1984)  and  in  some  varia¬ 
tions  of  Black  Dialect.  Acoustically,  it  is  manifested  by 
a  relatively  high  F3  (above  2000  Hz),  and  a  relatively 
low  F2  (below  about  1400  Hz).  Perceptually,  it  does 
seem  to  differ  from  the  preceding  vowel,  at  least  to  the 
untrained  ear.  We  recommend  it  be  given  a  different 
token-label,  since  it  is  not  a  retroflexed  phoneme. 

Finally,  a  third  phenomenon  accounts  for  the  remaining 
"mirsed"  tokens.  In  these  instances,  (five  /r/s,  two  jS/s. 
and  one  /$/)  both  F2  and  F3  are  relatively  high:  i.e.. 
they  straddle,  or  are  above,  the  2000  Hz  criterion  for  a 
maie  retroflexed  token.  These  tokens  are  in  the  fre¬ 
quency  regions  that  we  presume  are  more  usuai  for 
femaie  voices  lor  higher  pitched  voices  generally). 


Neverthel 5ss,  they  currently  are  classified  as  "true 
mbsea."  However,  wr  are  relatively  confident  they  will 
be  detected  whin  the  modified  version  of  Energy  Sum  R, 
designed  specifically  to  analvze  higher  pitched  voices,  is 
fully  developed. 

In  conclusion,  the  current  retroflexion  detector  se»ms  to 
offer  great  promise.  When  the  "diminished"  retroflexion 
tokens  are  deleted  from  the  target  corpus,  the  correct 
detection  rate  is  about  91  percent.  Assuming  the  suc¬ 
cessful  incorporation  of  an  F4  tracking  program  into  the 
present  analysis  schemj,  the  "low  F4  misse."  should  also 
be  detected.  Th»n  the  hit  rate  will  reach  about  97  per¬ 
cent.  Finally,  the  modified  Energy  Sum  R  (F)  detector 
might  detect  the  remaining  three  percent  of  "missed" 
retroflex  tokens,  thereby  bringing  overall  performance  to 
near  100  percent  correct  detection.  This  would  be  its 
performance  capability  tempered  by  a  current  estimated 
false  alarm  rate  of  about  14  percent  (572  incorrect  detec- 
tions/4158  nonretroflex  tokens). 
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Abstract 

SEARCH  and  a  retroflexion  detector  identified  vowel- 
tokens  that  (1)  were  classified  aa  retroflexrd,  and  (2) 
shared  a  boundary  with  /r/.  The  vowels  fal,  /£/,  /}/, 
and  /I  /,  among  others,  show  strong  coarticulatory 
affects.  Data  supporting  this  conclusion  are  presented. 

Introduction 

A  pilot  study  of  cosTticuiation  effects  in  vowels  due  to 
port-vocalic  / r/,  was  mo.de  for  a  subset  of  the  DARP\ 
Acoustic  Phonetic  Data  Base.  Using  the  SEARCH  Pro¬ 
gram,  the  technique  was  to  find  where  the  retroflexion 
detector  fired  inside  vowel-  tokens  that  shared  a  boun¬ 
dary  with  /r/.  We  have  found  the  vowels  /a./,  /£/,  /o/, 
and  faj  are  among  the  labeled  nonretroflexed  tokens  that 
are  strongly  retroflexed.  (See  Figure  2,  Gengel,  Majurski 
and  Hieronymus,  these  Proceedings.)  We  conclude  that 
strong  retroflexion  is  indeed  present  in  these  tokens;  i.e., 
F3  Is  below  2000  Hz,  for  various  portions  of  their  total 
duration.  Histograms  of  coarticulation  extent  are 
presented  below. 

Method 

The  corpus  of  sentences  used  in  this  analysis  was  the 
same  sentences  described  in  Gengel,  Majurski  and  Hiero¬ 
nymus  (1387,  these  Proceedings).  However,  in  order  to 
Increase  sample  size,  additional  sentences  from  the 
D  \RP\  \coustic  Phonetic  Data  Ba^e  were  also  included. 

Figure  1  shows  the  layout  used  for  analysis.  The 
displays  are  an  Original  Waveform  Window,  a  Wide 
Band  Spectrogram,  a  Wideband  Spectral  Slice,  an  LPC 
Spectral  Slice,  and  a  Phonetic  Transcription  Window,  all 
of  which  are  part  of  MIT-Spire  (Cyphers,  1085);  a  Vax 
Pidft  Spectral  Slice,  a  display  of  the  CMU-Darpa- 
System-  output;  and  F2  and  F3  formant  tracks,  a  part  of 
the  NBS  system  developed  by  Majurski  and  Hieronymus 
(1987,  these  Proceedings).  The  goal  is  to  determine  the 
time  in  the  vowel  preceeding  /r/  where  F3  drops  to  a 
value  of  2000  Hz  or  less.  Aa  the  figure  shows,  there  is 
often  good  agreement  among  the  various  indicators,  aa  to 
the  frequency  of  F3.  (When  there  is  not.  the  value 
determined  by  the  Wideband  Spectral  Slice  Window  is 
used.)  When  the  2000  Hz  F3  pitch  period  has  been 
located,  the  cursor  in  the  Original  Waveform  Window  is 
automatically  aligned  in  the  aarne  timt  frame  in  the 


Phonetic  Transcription  Window.  We  then  measure  to 
determine  whether  the  cursor  is  in  the  firat.  second,  third 
or  fourth  quartile  of  the  vowel  (or  whether  it  is  actually 
in  the  token  even  preceeding  the  vowel;  or  alternatively, 
whether  it  is  within  the  /t“/-token  boundary  itself). 
Thus,  for  example,  in  the  top  panel  of  Figure  l.  F3  is 
below  2<XX)  Hz  in  the  /w/  rhit  proceeds  the  /£/  that 
proceeds  the  /r/;  i.e..  a  relatively  long  coarticuiatmn 
•fleet.  And  in  the  lower  panel,  F3  drops  below  2fXX)  Hz 
in  the  second  quartile  of  the  vowel  preceeding  /r/:  i.e.,  a 
relatively  shorter  coarticulation  effect. 

Results 

A  nummary  of  the  retroflexion  analysis  is  shown  in  Fig¬ 
ure  2.  Note  that  the  duration  of  th»  vowel  preceding  /r/ 
has  been  divided  into  quartiles.  As  the  quartile  value 
increases  from  one  to  four,  the  longer  is  F3  below  2000 
Hz  prior  to  the  pnonemic  (perceptual)  onset  of  /r/;  and 
thus,  the  longer  is  the  dunbts  of  the  coartieuiatory 
manifestation  of  retroflexion.  (Recall  Figure  1.) 

Note  that  most  of  the  samples!  vowab  show  the  coarticu¬ 
lation  effect:  94  %  for  fas /,  *>4*?  for  far/,  83%  for  /or/, 
and  88%  for  far/.  The  duration  of  the  coarticuiation 
effect,  for  ail  four  vowels  vanes  from  relatively  short 
(first  quartile)  to  relatively  sang.  For  the  far/  coarticu- 
iated  tokens,  61  percent  off  the  /a/  durations  are 
retroflexed  for  over  half  of  tkssr  total  durations:  for  <tr/. 
similarly,  41  percent  of  the  £/  durations  are  retroflexed 
for  over  half  their  total  duraicm. 

From  the  amount  of  data  analyzed  jo  far  it  is  difficult  to 
reliably  fit  a  gauasian  to  the  histogram  data.  So  the  reli¬ 
able  means  and  variances  of  Is  ^articulation  durations 
will  be  determined  in  subs  quest  work. 

Based  on  these  initial  finding  we  conclude  that  many  of 
the  retroflexion  "errors'  signaled  bv  the  detector  are  not 
errors  but  rather  roflect  the  eSfct  of  coarticuiation.  For 
example,  furi  .*.  analysis  of  the  44  fa/  "errors"  'Figuro 
2.  op.  cit.),  indicate  that  25  faJs  preceded  />  .  6  followed 
f r/.  and  13  were  not  articulated  in  aa  >/  env'tronftlpu. 
The  25  pre-tr/s  and  the  6  post-/r/s  all  showed  coarticu¬ 
lation  affects.  The  post-/*/  effects  were  small,  never 
beyond  the  tint  quartile.  The  13  ‘true  error"  detections 
have  not  yet  been  analyzed  (oily  However,  three  are 


'1 


v V 

'  y  > 


associated  with  /ik0/  coarticulation,  the  "velar  pinch" 
described  by  Zue,  (1988),  wherein  F2  and  F3  in  faj,  and 
other,  front  vowels,  "pinch"  together  at  the  /k°/  boun¬ 
dary. 

These  strong  coarticulations  occur  across  word  boun¬ 
daries  as  well  as  within  words.  Therefore,  it  is  impor¬ 
tant  'hat  the  effect  of  /r/  on  nearby  vowels  be  taken 
into  account  in  the  DARPA  Speech  Recognition  Systems. 
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There  is  no  end-of-sentence 
punctuation.  Nor  is  there  any  required 
special  symbol  to  denote  silences  (either 
pre-pended,  within  the  sentence  utterance, 
or  appended)  or  to  indicate  failure  of  a 
system  to  parse  the  reference  string  or 
input  apeech. 

Apostrophes  are  represented  by 
pluaseg.  Words  with  apostrophes  (embedded 
or  appended)  are  represented  as  single 
words.  Thus  "it's"  becomes  "IT+S". 

Abbreviations  become  single  words. 
All  periods  indicati  g  abbreviations  are 
removed  and  the  wore  is  closed  up  (e.g. 
"U.  S.  A."  becomes  "USA"). 

Hyphenated  items  count  as  single 
words.  In  general,  compound  words  that  do 
not  normally  appear  as  separate  words  in 
the  context  of  the  assumed  task  domain 
model  are  entered  as  single,  hyphenated 
items.  The  exception  to  this  rule  are 
compounds  that  include  a  geographic  term, 
such  as  STRAIT,  SEA  or  GULF.  Thus  entries 
such  as  the  following  count  as  single 
"words":  HONG-KONG,  SAN-DIEGO,  ICE-NINE, 
PAC-ALERT,  LAT-LON,  PUGET- 1 ,  M- 'RATING,  C- 
CODE,  SQO-23,  etc.  However,  BERING  STRAIT 
is  to  count  as  two  words  since  this 
compound  includes  the  geographic  term 
"STRAIT",  and  it  is  not  to  be  hyphenated. 

Acronyms  count  as  single  words,  and 
the  output  representation  is  not  the  form 
of  the  acronym  made  easier  to  interpret  or 
pronounce  (e.g.  "PACFLT" ,  not  PAC-FLEET  or 
PAC  FLEET) . 


Mixed  strings  of  alpha-numerics  are 
treated  as  acronyms.  Thus,  "A42128"  is 
treated  as  a  one-word  acronym,  even  though 
the  prompt  form  of  this  Indicates  that 
this  is  to  be  pronounced  as  "A-4-2-1-2-8" . 
Strings  of  the  alpha  set  are  also  treated 
as  acronyms  (e.g.  "USA")  Strings  of 
digits  are  entered  in  a  manner  that  takes 
into  account  the  context  in  which  they 
appear.  Thus  for  a  date  such  as  1987,  it 
is  represented  as  three  words:  "NINETEEN" 
"EIGHTY"  "SEVEN" .  If  it  is  referred  to  as 
a  cardinal  number  it  would  be  represented 
as  "ONE"  "THOUSAND"  "NINE"  "HUNDRED" 
"EIGHTY"  "SEVEN". 

SCORING  THE  TEST  MATFRIAL 

For  results  to  be  reported  at  the 
March  meeting,  the  use  of  different 
scoring  software  will  be  acceptable.  Each 
contractor  was  free  to  use  software 
consistent  with  the  following  general 
requirements : 

Lata  are  to  be  reported  at  two 
levels:  sentence  level  an'  word  level. 


At  the  sentence  level,  a  sentence  is 
to  be  reported  as  correctly  recognized 
only  if  all  words  are  correctly  recognized 
and  there  are  no  deletion  or  insertion 
errors  (other  than  insertions  of  a  word  or 
3ymbol  for  silence  or  a  pause) .  The 
percent  of  sentences  correctly  recognized 
is  to  be  reported,  along  with  the  percent 
of  sentences  that  contain  (at  least  one) 
insertion  error(3),  the  percent  of 
sentences  that  contain  (at  least  one) 
deletion  error (3)  and  the  percent  of 
sentences  that  contain  (at  least  one) 
substitution  error(s).  The  number  to  be 
used  for  the  denominator  in  computing 
these  percentages  is  the  number  of  input 
sentences  in  the  relevant  test  subset, 
without  allowing  for  rejection  of 
sentences  or  utterances  that  may  not  parse 
or  for  which  poor  scores  result. 

At  the  word  level,  data  ♦‘hat  are  to 
be  reported  include  the  percent  of  words 
in  the  reference  string  that  have  been 
correctly  recognized.  For  there  tests, 
"correct  recognition"  does  not  require 
that  any  criterion  be  satisfied  with 
regard  to  word  beginning  or  ending  times. 
It  is  valuable,  but  not  required,  to 
report  the  percent  of  insertion,  deletion, 
and  substitution  errors  occurring  in  the 
system  output. 

For  those  systems  that  provide 
sentence  or  word  lattice  output,  scoring 
should  be  based  on  the  top-ranked  sentence 
hypothesis.  Additional  passes  through  the 
alternative  hypotheses  are  acceptable, 
provided  the  data  are  compared  with 
comparable  data  for  the  top-ranked 
hypothesis . 

System  response  timing  statistics 
should  be  reported. 

Data  resulting  from  these  tests  is  to 
be  provided  to  NBS  following  the  March 
meeting  for  detailed  analysis  and  in 
evaluating  alternative  scoring  software. 

DOCUMENTATION 

Documentation  on  the  characteristics 
of  the  imposed  grammar(s)  must  be 
provided.  This  information  should  describe 
any  use  of  the  material  from  which  the 
test  material  was  drawn  (i.e.  the  set  of 
2200  task  domain  sentences  developed  at 
BBN  and  used  by  TI  in  recording  the 
Resource  Management  Speech  Database). 

The  system  architecture  and  hardware 
configuration  used  for  these  tests  should 
be  documented. 
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ABSTRACT 

This  paper  describes  considerations 
in  selecting  test  material  for  the  March 
'07  DARPA  Benchmark  Tests.  Using  a  subset 
of  material  available  from  the  Task  Domain 
(Resource  Management)  Development  Test 
Set,  two  seta  of  IOC  sentence  utterances 
were  identified.  For  Speaker  Independent 
technology,  10  speakers  each  provide  10 
test  sentences .  For  Speaker  Dependent 
technology,  4  speakers  each  provide  25 
test  sentences.  For  "live  talker"  test 
purposes,  three  30-sentence  scripts  ware 
identified,  using  a  total  of  70  unique 
sentence  texts.  The  texts  of  all  of  these 
test  sentences  were  drawn  from  a  set  of 
2200  sentences  developed  by  BBN  in 
modelling  the  (resource  management)  task 
domain . 


INTRODUCTION 

In  order  to  implement  benchmark  tests 
of  speech  recognition  systems  to  be 
reported  at  the  March  '07  DARPA  Speech 
Recognition  Meeting,  it  was  necessary  to 
specify  selected  test  material.  This  test 
material  is  drawn  from  two  sources:  (a) 
the  Task  Domain  Speech  Database  recorded 
at  Texas  Instruments  (also  referred  to  as 
the  "Resource  Management"  Database),  and 
(b)  the  use  of  "live  talkers"  in  site 
visits.  In  each  case,  the  texts  of  the 
sentences  were  drawn  from  a  set  of 
sentences  developed  by  BBN.  Selection  of 
test  material  using  the  Resource 
Management  Database  includes  two  separate 
components,  a  Speaker  Independent 
component  and  a  Speaxer  Dependent 
component.  This  paper  outlines  the  process 
of  defining  these  subsets  of  speech 
material. 

At  the  time  the  Resource  Management 
Speech  Database  waj  designed,  it  was 
intended  that  approximately  equal  volumes 
of  material  would  be  available  for  system 


development  (research)  purposes  and  for 
two  rounds  of  benchmark  tests. 
Consequently,  approximately  half  of  the 
available  material  is  designated 
"development"  or  "training"  material,  and 
the  remaining  portion  ir  designated  for 
test  purposes.  The  test  material  is 
designated  as  "Development  Test"  or 
"Evaluation  Test"  sets,  each  including 
1200  test  sentence  utterances  in  each 
Portion  (Speaker  Independent  or  Speaker 
Dependent ) . 

The  design  and  collection  of  this 
Task  Domain  (Resource  Management)  Speech 
Database  is  described  elsewhere  in  this 
Proceedings  in  a  paper  by  Fisher  [1]. 

Thus,  as  originally  intenaed,  two 
sets  of  1200  sentence  utterances  were  to 
be  available  for  the  March  '07  tests. 
During  January  1907,  discussions  involving 
representatives  of  CMU,  BBN,  MIT,  NBS  and 
the  DARPA  Program  Manager  determined  tha" 
use  of  this  large  a  volume  of  test 
material  was  not  necessary  to  establish 
performance  of  current  technology  when 
pragmatic  considerations  of  processing 
times  and  expected  performance  levels  were 
made.  Consequently,  it  was  agreed  that 
subsets  of  100  sentence  utterances  were  to 
be  defined  for  these  tests,  and  that  NBS 
would  specify  the  appropriate  subset. 

To  complement  the  use  of  the  recorded 
speech  database  material,  a  test  protocol 
for  the  use  of  "live  talxers"  emulating  in 
some  sense  procedures  to  be  used  in  future 
demonstrations  of  these  systems  was 
defined,  and  texts  were  selected  for  this 
purpose . 


RESOURCE  MANAGEMENT  SPEECH  DATABASE  TEST 
MATERIAL 

Speaker  Independent  Test  Material 

For  the  March '07  tests,  a  set  of  ter. 
speakers  was  identified,  drawn  from 
material  recorded  at  TI  and  made  available 
to  NBS  in  December  '06  and  .January  '07 


Each  sipeiker  provided  two  "dialect"  and 
the  ten  "rapid  adaptation"  sentences  in 
addition  to  a  total  of  thirty  test 
sentence  utterances.  For  each  speaker,  a 
unique  subset  of  ten  sentence  utterances 
were  specified  to  be  used  for  the  March 
'97  tests,  amounting  to  100  sentence 
utterances  in  all  (10  speakers  times  10 
sentence  utterances  per  speaker) . 

Seven  male  speakers  were  selected  and 
three  female  speakers,  reflecting  the 
male/ female  balance  throughout  the 
Resource  Management  Speech  Database. 

To  aid  in  tha  selection  of  individual 
speakers,  a  set  of  approximately  16 
speakers  was  identified.  SRI  was  asked  for 
advic*  on  whether  any  of  these  would  be 
regarded  as  anomalous  on  the  basis  of  the 
"dialect"  sentences  obtained  in  the 
acoustic-phonetic  datobaae  .  SRI  performed 
a  clustering  analysis  and  advised  us  that 
mo^t  of  the  speakers  clustered  in  three 
groups  of  similar  speakers  with  three 
other  individuals  categorized  as 
exceptional  in  some  sense  (e.g.  unusually 
slow  rate  of  speech)  [2].  The  tun  speakers 
identified  for  inclusion  in  the  test 
jubsat  include  one  of  these  "exceptional" 
speakers,  the  others  being  drawn  from  the 
three  clusters  to  provide  some  degree  of 
coverage  of  regional  effects. 

Table  1  provides  detailed  information 
on  the  individual  speakers'  regional 
oackgrounds,  race,  year  of  birth  and 
educational  level  for  the  ten  selected 
speakers  in  the  March  '87  Test  Subset. 

Analysis,  by  TI ,  of  the  lexical 
coverage  provided  by  this  subset  of  the 
test  material  indicates  that  348  words 
occur  at  least  once  in  this  test  material, 
and  the  total  number  of  words  is  836,  for 
a  mean  length  of  each  sentence  of  8.36 
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Speaker  Dependent  Test  Mat  >£i al 

For  these  tests,  a  set  of  rour 
speakers  was  identified,  also  drawn  from 
material  recorded  at  TI  and  made  available 
to  NBS  during  December  '86  arid  January  ’  07  . 
In  this  case,  selection  of  the  specific 
individuals  was  strongly  influenced  by  the 
availability  of  training  material.  BBN 
expressed  concern  that  the  entire  set  of 
600  sentence  utterances  intended  for 
system  training  should  be  available  for 
any  test  speakers.  At  the  time  of 
selection  of  test  material,  net  all  of  the 
12  speakers  for  this  portion  of  the 
database  had  completed  recording  their 
training  material.  With  this  in  mind  four 
speakers  were  identified. 

Each  speaker  had  previously  recorded 
the  ten  "rapid  adaptation"  and  "dialect" 
sentences,  and  the  Development  Test 
material  included  100  sentence  utterances 
for  each  speaker.  From  this,  unique  sets 
of  25  sentence  utterances  were  identified 
for  each  of  the  four  speakers,  amounting 
to  100  sentence  utterances  in  all  for  this 
portion  of  the  test  material. 

Three  of  the  speakers  were  male  arid 
one  was  female. 

Table  2  provides  additional  data  on 
these  speakers. 

Analysis,  by  TI ,  of  the  lexical 
coverage  provided  by  this  subset  of  che 
test  material  indicates  that  B32  words 
occur  at  least  once,  with  a  total  number 
of  words  of  B32,  for  a  mean  sentence 
length  of  B.32  words.  This  is  quite 
similar  to  that  for  the  Speaker 
Independent  material,  although  the  details 
of  the  distributions  differ  slightly. 
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Race 
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Educat'on 
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NORTHERN 
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WHT 

40 

B.S. 

PM* 

“ALE 

SOUTHERN 
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Table  2  Speaker  Oependent  Test  Subset 


LIVE  TALKER  TEST  MATERIAL 

For  the  "live  tests”,  it  was 
necessary  to  select  sentence  texts  that 
would  be  read  by  the  test  speakers.  It  was 
thought  desirable  to  use  three  speakers, 
each  speaker  readinq  a  total  of  30 
sentence  texts  in  addition  to  the  10 
"rapid  adaptation”  sentences.  Ten  of  the 
thirty  sentence  texts  were  to  be  the  same 


ROBUST  HMM- BASED  SPEECH  RECOCNTTION:  AN  UPDATE 


Clifford  J.  Wlnac  In 


Lincoln  Luhoracory,  M-,<achu>cci  Intltut:  of  Technology 
Lexington,  Massachusetts  02171-0073 


INTRODUCTION 
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These  oroce'Oinqe  [2-5]  provide  an  update  on  aajor 
technical  accomplishments  over  the  paat  year. 
This  sunaary  provldaa  an  overall  introduction  to 
the  accompany  inq  papers,  and  uotaa  aoaa  currant 
efforts  snd  soaa  additional  accoapl  lahaante  of  the 
Lincoln  proqraa  in  robuat  apaach  recognition. 


OVERVIEW  Or  TECHNICAL  APPROACH 
TO  ROBUST  RECOGNITION 


OATA  BASES  Or  SPEECH  PR00UCE0 
UNDER  STRESS  AND  IN  NOISE 
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Our  approach  to  achlavinq  h lqh-per foraance 
recognition  of  apaach  produced  under  atraaa  snd  In 
noise  haa  bean  to  develop  tachniquaa  for  enhancing 
the  robuatneea  of  a  baaatina  Hidden  Harkov  Model 
(HHH)  recognizer.  The  training  and  recognition 
nodules  of  a  beaalina  ieo 1  at ed-*o rd  HHH  ayatea  are 
depicted  in  Tig.  1,  ohite  Tig.  2  indicates  the 
robustness  enh ancanants  *h ich  have  bean  developed 
and  tested.  Oascnptiona  of  the  varioua 
anhancenants  snd  their  effectiveness  are  described 
in  the  accompanying  papara.  Many  of  the 
snhancenants,  such  aa  grand  variance  or  teaiporal 
diffaranca  parameters,  are  in  the  area  of  improved 
nodelling  and  training  in  the  franemork  of  the 
baalc  HHH  syatam.  Other  anhancamanta ,  auch  aa  the 
sacond-ataga  diacriminant  anatyaia  ayatam,  are 
outaida  the  baaic  HHH  frsmemork. 


Tig.  1.  Hidden  Harkov  Modal  lao 1  a t ad-*o r d 
recognition  syatam. 
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Two  primary  data  baaea  hava  baan  uaad  for  tha 

l  ao  1  at  ad- wo  r  d  robuat  recoqnition  alqorithm 

development  work*  (1)  tha  "TI-atraaa"  105-word 
vocabulary  data  baaa  [6],  tncludinq  simulated- 

streoe  throuqh  talker  atyla  variation  and  noiaa 
exposure  (Lombard  condition);  and  (2)  the 
"Lincoln-etrese"  data  baae  [2],  includinq 
s  imulat  ed-at  reae  via  talker-atyle  va  *■  ion, 
l0»u  id  condition,  and  workload  straea,  with  a 
35-word  vocabulary  compoaad  of  scouat ica 1 1 y- 
similar  subeate  of  tha  T1  105-word  vocabulary. 

Additional  experments  have  been  conducted  on 
'•TI  -  IW0V  H  a  standard,  normally  spoken  20-word 
vocabulary  data  base  [7]. 
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fiq.  2,  HHH  isolated-word  recognition  system 
with  robustness  enhancement. 


A  aejor  focue  of  out  currant  errorte  la  to 
eatend  tha  robuat  racoqnition  «ork  to  operate  »ith 
high  parToraanca  on  eont  Inuouely-epoken  aaquanca* 
of  aorda,  undar  condttlona  of  atraaa  and  noiaa. 
Our  aork  la  directed  opac i f ica 1 1 y  at  limited* 
vocebulery,  raatrictad  taak  doaaina  represent at i vi 
or  tha  Pilot 'a  Aeeociete  application  or  the  OAkPA 
Strategic  Cooputing  Prograa.  To  tnia  and,  a 
prototypa  continuoua  .ach  HMM  trainer  and 
recognizer  hea  bean  brought  into  operation  end 
aubjactad  to  initial  teeting.  Tha  syatee  treina 
on  continuoua  apeach,  can  uae  either  aubeord 
ftodele  or  ahole-aord  aodela,  and  includaa  relevant 
robuetnaae  techniquea  uaad  in  tha  i eo 1 et ad-»o r d 
ayatea.  Tha  cont inuoue-eoeech  racognition  TCSR) 
ayataa  it  operating  alth  good  preliminary 
raeulte.  Tha  na*  CSR  ayataa  haa  alao  been 
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ABSTRACT 

Moat  currant  peach  recngnit  Ion  svatem  ara  sensitive 
to  vanationa  in  speaker  style.  Tha  following  ia  tha 
raault  of  at  effort  to  aaka  a  Hidden  ttarkov  Modal  t*#i) 
laolatad  Word  tacngnlzair  llNR)  tolerant  to  eurti  speech 
charges  auaed  by  speaker  atraaa.  Mora  than  an 
order -of -magnitude  reduction  of  the  arror  rata  wm  achieved 
for  a  TOJ-word  alKjlated-at rna  data  baaa  and  a  05  arror 
rata  me  achieved  for  tha  TI  TO  ieolet  d-word  data  beat. 

INTRODUCTION 

Current  recognlt Ion  eijor't  >m  are  generally  fat  aora 
ernelttve  to  veriatlona  in  apt  ing  atyla  and  conditiona 
than  are  huean  liatanara.  Ma  ,  fee  to  re  can  cauee  ouch 
channea  in  apeak ing  atyla  in  an  operational  environment, 
for  lnatanca,  typical  caueee  are  ties  (a  mak  or  aora), 
naaal  congee t ton,  aaotlonal  atata,  expression,  and 
taak-inducmd  atraaa.  N  are  apeclflcally  Intareeted  In 
tesk-inducad  atraaa,  but  tha  aquatic  i.hangea  *  ieai  to  be 
alallar  for  aany  cerate  Typical  affecta  of  atraaa  are 
eftangea  In  apectral  tilt,  foraent  poeltlon,  energy,  tlelng, 
and  phonetic  content.  The  following  describee  work  Mich 
haa  ylaldad  aora  than  an  order-of-mgnltuda  reduction  In 
the  error  rate  of  an  IMi  IkR  over  a  aulti-apaech-etyla 
data  beaa  aa  mil  aa  algnlflcant  Improvements  for  noraelly 
apoken  apaech. 

Since  It  la  difficult  to  obtain  large  amovits  of  data 
froa  atraaaed  aubjacta,  m  have  uaed  a  lejlt  l-etyla 
a leulatod-et r-aa  data  baea  generated  at  TI  [1].  Thla 
data  beaa  hae  1  apeakare  (SM  *  3T),  a  TOV-word  aircraft 

vocabulary  and,  for  aech  speaker,  a  (nor* ally  epoken)  S 

token  par  word  training  eectlon  and  i  atyla  aactiona  of  2 
tekena  per  word  oach  for  teeting.  Tha  apench  wea  digitized 

with  a  0  Ml  audio  bandwidth.  Tha  ala  conditiona  are l 

no real,  feat,  loud,  LoMsrd  Inoiaa  presented  in 
haadohonea/,  aoft,  and  about.  Tha  about  condition  ia  eo 
different  froa  the  other  conditiona  that  it  hae  bean 
largely  Ignored.  Tha  work  hae  focueed  on  tha  other  S 
conditiona  with  thair  overall  average  aubetitution  arror 
rata  (avgS)  oa  the  primary  manure  of  performance. 

The  ybearvetlnns  uaed  by  thla  ayetee  are  centlaecond 
mel-cepstre  [2].  Thaee  mrl-cepstre  are  codwted  froa  tha 
digitized  sp»tch  by  the  follcwirg  processing  tie  jence i 

1.  20  msec  Homing  window. 

2.  256  ooint  FFT  (  - e  japlex  soectrum). 

5.  Magnitude  aquarad  — t  power  spectrum). 

a.  Preenor.ee lei  S(f)  >  S(f)  •  (1*(  f'VKJHz)2). 


J.  Triangular  ml-bsndpess  auaeetlone  ( — *  eel  power 
apactnm).  Conatant  area  fllterei  '00  Hz  apaclng 
between  .1  and  1  kHz,  102  above |  width  •  2* 
spacing. 

i.  Convert  aal  power  epectrua  to  dB. 

7.  ttodlflad  coalne  tranafora  [2]  I — »  eel-cepatrun). 

Slallar  ml-  epetral  obaarvatlnna  have  bean  uaed 
eucceeefully  in  e  tunbsr  of  recognition  ayetene  See,  for 
saaapls,  [3]. 

The  heeic  ayetaa  ia  e  dlagonel-eovarlanee-netrU 
aultl-variate  Gauaeian  pr  debility  density  cent  lrvjoui- 
obearvatlon  [*]  taolat  d-word  tMI  recognizer  wing  the 
above  aal-cwpatrel  obeervetlona.  (The  absolute  energy 
tarn,  cO,  la  not  used.  "be  eyetam  augmented  with 

differentlel  cbeerv-.t  lone  'see  below)  Include  a 
differential  energy  Tara.)  Only  one  aodal  la  uaed  per 
word.  The  ayatee  la  trained  by  tha  Beum-kaleh 
(fo rear  ! -backward)  algor  1  the.  Since  the  data  file"  contain 
a  few  hundred  "W  of  back-round  naira  at  each  end,  tho  first 
and  last  nodes  of  the  aodal  are  dedicated  to  modsllng  the 
background.  (Both  the  training  end  recognition  are  opart 
endpoint).  The  termination  of  the  observations  la  modeled 
by  a  transition  to  a  degenerate  node.  Tha  recognizer  uses 
a  Vlterbl  decoder.  All  system  reported  hare  tae  "linear* 
networks— 1. a. ,  there  ere  no  nodal  aklp  transitions. 

A  varlaty  of  training  conditions  ard  MM  lyetemo  mre 
tasted.  Tha  training  conditiona  are  "normal"  training, 
Mars  only  tha  normal ly  spoken  training  eectlon  of  the 
data  bees  was  used  for  training,  and  "multi-style*  training 
where  tha  first  token  of  aech  word  of  aach  atyla  waa  added 
to  the  training  set  and  the  second  tolan  wee  peed  for 
teeting.  Variations  In  the  system  arm  of  aaveral  forms i 
nuaber  of  nodaa  per  word,  observation  enhancements,  the 
method  of  obtaining  tha  varlanc-e,  the  training  start 
state,  um  of  adaptive  background  nodes  during  recognition, 
the  duration  model,  ard  modification*  to  Improve  parameter 
set  last  s  in  tha  face  of  small  amounte  of  training  data. 
Unless  otherwise  stated,  all  syatmaa  hava  10  netlve  nodes. 

NORMAL  TRAINING 

The  normal  training  systems  used  3  normully  .token 
tokens  per  word  for  (spe iker-aperlfic)  training  -a  »  total 
of  1680  ms  per  style  (8000  per  evg5)  for  testing.  Gecn 
system  will  be  identified  by  e  code.  hie  error  rates 
quoted  ire  the  evq5  percent  error.  More  detail  will  Se 
found  in  Table  1  od  Figure  T. 

"he  baneline  ays*em  ujed  treined  nooel  variances 
(baseline,  '"i  20.0925.  Lower  bounding  the  variances 

(vensnee  limiting),  Mien  reduced  the  effects  of  limited 
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training  data,  raducad  tha  arror  rata  to  (v1,10:  IS. 923). 
Augmenting  tha  observations  with  20  na  temporal  differences 
of  tha  ml-ceodtra  furthar  raducad  tha  arror  rata  to 
(v1,d2,10i  10. SOS). 

Tha  abova  systems  all  uaad  individually  tralnad 
varlanca  vactora  for  aach  node.  Tha  following  ayateaa  uaa 
tha  anna  variance  vector  for  all  nodes.  An  L2  non  (all 
variances  a  1)  performed  poorly  (12,  10)  13.S83).  Fixad 
variances  equal  to  tha  variances  of  all  tha  training  speech 
of  all  speakers  (Fig.  2)  yielded  (raw,  10)  8.763).  A 
perceptually-baaed  fixed  variance  (aea  below  and  Fig.  2) 
raducad  tha  error  rats  to  (gfv,10i  6.133).  Adding  20  na 
temporal  differential  paramatara  (gfv,  d2,10:  6.993), 
increasing  tha  nunbar  of  nodes  to  16  (gfv,  d2,16i  2.S6S), 
modifying  tha  fixad  variance  yielded  (gofv,d2,16i  2.263), 
and  finally,  adding  an  andpoint -baaed  training  atart  state 
and  idaptive  background  estimation  to  tha  recognizer 
reduced  tha  error  rata  to  (gafv,d2,b,16:  1.9SS).  A  trained 
"grand  varlanca"  (Fig.  2)  computed  in  tha  Bauat-Welch 
reeatinat ion  procedure,  in  Wiich  tha  variance  is  tied  over 
all  nodea  of  all  words,  approached  tha  fixad  variance 
(grandv,  d2,b,16:  2.9S3). 

MULTI-STYLE  TRAINING 

Tha  multi-sty la  trained  ayatana  added  1  token  from 
aach  non-shout  test  atyla  to  tha  training  eat  for  10  tokens 
par  word  (mtaa).  (Similar  raaults  wars  obtained  even  whan 
a  shout  token  was  included  in  tha  training.)  This  laft  860 
test  tokens  per  atyls  and  6200  tokens  for  the  evgS.  Nona 
of  the  teat  tokens  wars  uaad  far  training. 

In  general,  the  results  laproved  significantly! 


normally  trained 

mu  tl-etyla  trained 

v1,d2,b,16: 

7.713 

mtas,v1 ,d2,b,16: 

1.123 

gafv,d2,b,16: 

1.953 

mtaa,gafv,d2,b,16: 

.933 

grndv,  d2,b,16: 

2.9SS 

mtaa, grandv, d2,b, 16: 

.883 

Multi-etyls  training,  by 

presenting  more 

legitimate 

variation  to  the 

training 

algorithm  than  does 

the  normal 

training,  appears 

to  cauaa 

tha  training  to  find  a  better 

model  for  the  word  (5].  In  addition,  performance  on  tha 
normal  atyls  usually  improves  as  a  result  of  tha 

aulti-etyla  training. 

THE  PERCEPTUALLY  MOTIVATED  FIXED  VARIANCE 

Vowal  perception  experiments  (6]  have  Indicated  that 
formant  position  is  more  important  than  spectral  tilt  in 

vowel  identification.  Our  Investigations  also  showed  large 
apectral  tilts  to  be  one  of  tha  affects  of  speaker  rtrsea 
and  atyla  (7].  However,  apectral  tilt  is  also  s  strong  cue 
for  distinguishing  between  voiced  and  unvoiced  speech. 
Tharafora,  a  weighting  (Inverse  variance)  which 
deeephaeizad,  but  did  not  totally  allmlnate  the  lot-order 
mel-cspatral  tarms,  -waa  applied  in  the  distance  measure 
(l.s.,  was  uaad  to  postulate  a  fixad  varlanca  in  tha 

Gaussian  probability  density  function).  Tha  lighting 

function  was  choasn  to  be  (Fig.  3): 


1<i<d  (normal  mei  cepatrs) 


0<i<d  (differs?ntial  mel  cepjtra) 
otherwise 


varCi]  *  1/w[i] 


where  i  is  tha  nel-cepstral  index  and  d  and  dd  are  tha 
observation  orders. 


This  weighting,  if  interpreted  aa  a  signal  processing 
operation  on  tha  nml -spectrum,  is  a  dynamic  range 
compressor  and  local  fsaturs  enhancer  very  similar  to  tha 
homomorphic  dynaeic  range  compression  and  contrast 
sr*  ncement  techniques  uaad  in  picture  processing  [8],  It 
also  provides  a  degree  of  tolerance  to  changes  in  tha  audio 
channel  (aaa  below). 


STRONGER  DURATION  K30ELS 


Tha  Fsrguaon  full  duration  model  (9],  which  modsia  the 
duration  se  a  vector  of  duration  probabilities  rather  than 
tha  dying  exponential  of  tha  standard  HHH,  yielded  mixad 
results  when  applied  to  soma  of  tha  above  systems.  It  la 
much  mora  computationally  intensive  than  tha  standard 
ayatewe  and  has  not  been  .dequataly  sxplorsd.  It  also 
appears  to  require  mora  training  data  than  waa  available 
for  these  experiments. 

A  doti>la-noda  subnet  la  much  simpler  and 
computationally  mora  efficient  than  tha  full  duration 
model.  Each  node  is  replaced  with  a  network  consisting  of 
2  series-connected  nodes  constrained  to  have  the  same 
obsarvation  probability  density  functions  and  tha  same 
aalf-tranaition  probabilities.  With  no  increase  in  tha 
nuaber  of  parameters  and  only  a  minor  increase  tha  total 
computation  (tha  computational  load  ia  dominated  by  tha 
probability  density  functional,  tha  duration  modal  ia 
changed  from  tha  usual  sxp(-at)  to  a  t"sxp(-at)  form.  This 
system  hae  not  been  daquata ly  explored,  but  tha  results 
are  encouraging:  from  (gafY,d2,10i  3.693)  to 

(dn,gafv,d2,10:  3.623).  Similar  duration  models  have  been 
axamlned  alaewhare  (10]. 


ADDITIONAL  RESULTS 


This  system,  which  waa  davalopsd  using  tha  TI 
almulated-atreaa  data  baea,  has  been  tasted  on  tx>  other 
data  bases  and  in  numerous  ilva-lnput  demonstrations.  Tha 
ayatem  hae  shown  similar  results  on  a  iocally-gsnerstsd, 
simulated,  and  workload  atreaa  data  baas  (3,11].  Tha 
ayatem  haa  elao  been  t sated  using  tha  TI  20  iaolated-xjrd 
data  baas  (16  speakers,  20-word  vocibulsry)  (12].  Ths 
arror  rataa  were  ae  follows: 


First  teat: 

ti-lwd:  gfv,d2,10:  '  (3/5120) 

Bast  result: 

tl-iwd:  gfvx,d2:  .003  (0/5120) 

Ths  beet  rssulc  reported  in  [12]  is  .205  (10/5120). 
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Llva-lrput  testa  of  tha  nonaally  trainad,  fixed- 
variariea  syatai  hev*  confirmed  tha  robuatnaaa  to  apaach 
atyla  and  hava  lndicatad  additional  tolerances.  Thla 
ayetaai  haa  boon  trainad  ovar  a  local  dlaled-up  talephona 
line  and  teatad  over  dlaled-up  long-distance  telephone 
lines.  A  teat  was  performed  when  the  speaker  had  a  cold. 
(The  training  data  had  been  recorded  months  earlier.) 
Under  both  or  these  conditions,  the  eyaten  haa  continued  to 
perform  well.  With  the  exception  of  the  adaptive 
background  nodes,  the  eyatem  did  not  adapt  in  any  way  to 
th~  tast  environment. 

The  eyatem  has  been  formally  tested  or  demonstrated 
over  bandwidtha  ranging  from  3.2  kHz  to  B  kHz.  (In  each 
case  the  testing  and  training  bandwidtha  were  identical.) 

A  variety  of  microphones,  audio  ayataos,  and  background 
noiee  levels  have  been  used  end  eyatem  haa  tolerated  them, 
all  within  reaaonabla  limits. 

DISCUSSION 

The  Improvements  reported  here  are  tha  reault  of 
several  philosophies:  the  modal  mjet  be  trainable,  imiat 
have  sufficiently  detailed  observations,  and  imist  ba 
tolerant  of  unanticipated  changes.  Any  parameters  used 
must  ba  trainable  on  realistically  available  amounts  of 
data.  Thus,  the  variance  limiting,  grand  variance,  and 
rixed-variance  eyaten  outperformed  tha  base  line  eyatem. 
Augmenting  tha  observations  with  time-differential 
parameters  provided  more  Information  to  tha  ay  a  t  ewe  and 
thus  yielded  further  Improvements.  Hidden  Hsrkov  models 
are  fairly  Insane It lva  to  tha  exact  number  of  nodes,  but 
the  Increase  from  ID  to  14  nodes  generally  Improved  their 
ability  to  model  the  given  vocabulary. 

A  "fully  trained"  modal  can  only  model  the  training 
data~lt  cannot  actively  anticipate  iWiat  It  has  not  seen. 
We  provide  a  priori  Information  to  these  ayateaa  In  several 
ways.  Tha  number  of  nodes  and  the  allowable  transitions 
between  them  are  ana  form  of  such  ■_  priori  Information. 
The  fixed  variance  la  another  wey  of  providing  useful  * 
priori  knowledge.  The  nonaally  trained  grand  variance 
eyaten  only  knows  about  speech  variations  found  in  lta 
training  data.  The  rixed  variance  lnforen  tha  eyatem  about 
the  kinds  of  variations  which  nay  be  encountered  In  speech 
and,  therefore,  outperform  the  grand  variance  for  normal 
training  with  style  testing,  and  gives  equivalent 
[mi  fui waive  for  normal  training  with  normal  testing  and 
multi-style  training  with  style  testing.  From  another 
viewpoint,  the  application  of  a  priori  knowledge  haa 
reduced  the  requirements  for  training  data  by  anticipating 
the  variation  In  the  teat  data. 

Tha  techniques  described  hare  Increase  the  tolerance 
of  the  recognizer  to  speech  variations.  Another  approach 
used  a  fixed  variance,  normally  trained  system,  modified  to 
compensate  for  the  spectral  tilts  during  recognition  [13]. 

CONCLUSIONS 

Several  techniques  for  improving  the  training  uid 
speech  imdeling  of  a  "textbook"  (baseline)  1M1  recognizer 
with  good  normal  speech  performance  have  been  combined  to 
significantly  improve  recognition  results  in  the  face  of 
speech-atyle  variation  and  snull  amounts  of  training  datj. 
Results  have  improved  from  a  20. 5S  avg5  error  rate  to  1.95% 
ir  normal  training  data  ie  used,  or  to  .88%  if  samples  of 
the  expected  speech  styles  ere  available.  These 


enhancements  have  ilso  Improved  performance  on  normal 
apaach,  aa  shown  by  the  .24%  error  rate  achieved  for  normal 
speech,  and  the  .06%  and  .00%  error  retea  achieved  on  the 
TI  2D  Isolated-word  data  base. 
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TABLE  I 

SUBSTITUTION  ERRORS  FOR  TI-SIMULATED  STRESS  ANO  TI-IWD  DATA  BASES 


Normal  Training: 

Individual  nodal  variancaa: 
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15.92 

vl ,d2, 10 
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vl ,d2  ,b,  14 

.48 

7.71 
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2.08 
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nodes : 
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individual  nodal  variancaai 
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Fig.  3.  Tha  pe rcsptuel ly-mo t i » a t ed  weighting. 
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ABSTRACT 

Automatic  speech  recognition  algorithms  generally  rely 
on  the  assumpt  ion  that  Tor  the  distance  meaaure  used, 
intraeord  variabilities  are  smaller  than  interword 
variabilities  so  that  appropriata  separation  in  tha 
measurement  space  is  fosaible.  la  evidenced  by  degradation 
of  recognition  perrormance,  the  validity  of  such  an 
aaeuaption  decreases  from  simple  tasks  to  complex  tasks, 
Troa  cooperative  talkers  to  casual  talkers,  and  from 
laboratory  talking  environments  to  practical  talking 
envirorments. 

This  paper  presents  a  study  of  talksr-atresa-inducsd 
intraword  variability,  and  an  algorithm  that  compensates 
Tor  tha  systematic  changes  observed.  Tha  study  ia  based  on 
Hidden  Harkov  Hodela  trained  by  speech  tokens  in  various 
talking  styles.  Tha  talking  stylea  include  normal  speech, 
rast  speech,  loud  speech,  sort  speech,  and  talking  with 
noisa  injected  throu^i  earphones |  tha  stylea  are  designed 
to  simulate  speech  produced  under  real  stressful 
conditions. 

Cepatral  coefricianta  sis  uswd  as  tha  parameters  in 
the  Hidden  Harkov  Models.  Tha  stress  compensation 

algorithm  compensates  Tor  tha  variations  in  the  cepatral 
coefficients  in  e  hypothesis-driven  manner.  The  functional 
form  of  the  co  lons  at  ion  ie  shown  to  correspond  to  the 
equalization  of  spectral  tilta. 

Preliminary  experiments  indicate  that  a  substantial 
reduction  in  recognition  arror  rata  can  be  achieved  with 
relatively  littla  increase  in  computation  and  storage 
requirement a. 

INTRODUCTION 

Current  speech  recognition  systems  generally  degrade 
significantly  in  performance  if  tha  systems  are  not  both 
trained  and  tested  undar  siailsr  talking  conditions.  A 
major  reason  for  performance  degradation  when  testing  and 
training  conditions  differ  is  that  people  speak  differently 
under  different  conditions.  Despite  tha  knowledge  that 
speech  patterns  charge  in  stress  an*  in  noisa,  little 
speed!  recognition  reeesrch  has  bean  directed  at  modeling 
syetematic  changes  observed  and  at  developing  recognition 
syetems  that  are  resistant  i  such  changes. 

This  paper  presents  e  study  or  talker-etreee-induced 
variations  in  speech  cepatral  coefficients,  and  an 
algorithm  that  compensates  for  systematic  (but  unknown) 
changes  observed.  The  study  is  based  on  isolated-word 
Hidden  Harkov  Hodei  ipeech  recognizer  [1]  trained  by  .peech 


spoken  in  various  talking  conditions.  The  recognizer  [1] 
is  a  continuous-observation  HWt  system  using  mel-frequency 
cepatral  parameters.  The  work  reported  in  this  paper  is 
described  in  more  detail  in  [2]. 

Tha  experiments  conducted  in  this  research  ware  based 
on  tha  "simulated  stress"  [3]  speech  data  base  collected  by 
Texas  Instruments. 

In  this  data  baaa,  stress-like  degradations  of  the 
speech  signal  were  elicited  by  asking  the  speaker  to 
produce  apeech  in  a  variety  of  styles  (normal,  fast,  loud, 
soft,  and  shout)  as  well  ae  with  95-dfl  pink  noise  exposure 
in  tha  asr  to  produce  tha  Lombard  efrect.  Tha  vocabulary 
consisted  of  IQS  words,  including  monosyllabic, 
polysyllabic,  and  confusing  words. 

Tha  data  base  waa  divided  into  training  data  and  test 
data.  Training  data  consisted  of  five  samples  of  each  of 
tha  IQS  words  collected  in  a  random  order  under  normal 
talking  conditions,  and  teat  data  consisted  of  two  samples 
of  aach  word  undar  aach  simulated-stress  condition.  Data 
were  collected  from  fiva  adult  males  and  three  adult 
feaalaa.  Tha  total  nuaber  of  teat  word  tokens  waa  10,080. 

AN  EXPERIfCNT  «  HULTI  STYLE-TRAINED 
HIDOEN  HARKOV  WORD  HJDELS 

Hultistyls  training  [4,5]  ia  s  technique  used  to 
improve  speech  recognition  performance  under  stress.  In 
multiatyla  training  a  raco^iizar  ia  trainsd  using  word 
tolmns  spoken  with  different  talking  stylea  instead  of 
using  words  all  spoken  normally.  It  has  been  found  to  be 
easy  for  s  talker  to  change  to  styles  such  sa  fast,  slow, 
loud,  aid  soft,  producing  changes  in  speech  characteristics 
that  are  similar  to  changes  that  occur  under  stress. 

An  experiment  an  multiatyie-treined  Hidden  Harkov 
Hodei  word  recognition  was  performed.  In  this  experiment, 
11  apeech  tokens  were  used  to  train  esch  word  modal:  5 
tokens  from  the  training  data  base,  and  6  tokens,  one  per 
talking  styla  except  normal,  from  tha  test  data  base.  The 
recoyiitian  error  rates  are  listed  in  Table  I.  for 
comparison,  tha  srror  rata  of  the  bsseiine  H+l  system  is 
also  included. 

from  Table  I  we  observe  that  there  is  dramatic 
performance  degradation  when  the  baseline  recognizer  it 
tested  with  styled  data:  and  that  mult istyie-training  haa 
considerably  improved  system  performance  for  speech  oata  of 
all  atylaa. 

It  appears  that  the  tMH  word  models  were  able  to 
assimilate  the  date  from  tha  multiple  styles  and  to  capture 
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statistically  the  on  invariant  faaturaa  of  aach  word.  In 
tha  next  aaction  we  invaatigata  the  groaa  changes  of  nodal 
parameters  resulting  fro*  aultistyle  training  aa  wall  m 
from  atyla  training  (aa  opposed  to  noraal  training). 

CEPSTRAL  0 CHAIN  STRESS  COMPENSATION-DRIVEN  8Y  OBSERVATIONS 

The  auccaaa  of  tha  nultiatyle  training  experiment 
aotivated  a  coapariaon  of  the  nodal  parraat..^  trained 
under  varioua  talking  atylea  to  deteraina  whether  it  would 
be  poaeible  to  coapenaete  for  the  cepatral  changaa  throu^i 
aiapls  tranaforaat iona  on  th»  cepatral  mans  and  variancea 
obtained  uaing  noraal  training.  Such  tranaforaat ion,  if 
affective,  would  aiaplify  the  training  procedure. 

The  diffsrencea  aaamg  normally  trained,  singls- 
atyle-trained,  and  aultiatyle-trained  word  nodela  are 
partially  reflected  in  the  average  ahifte  of  tha  naan 
valuea  and  in  the  average  acaling  of  the  variancaa  of  the 
cepatral  coafficienta.  To  atudy  each  difference,  aeven 
diffarant  aeta  of  word  aodela  wars  axaainad.  Six  of  the 
nodela  wars  trained  under  aix  individual  conditione 
(noraal,  faat,  loud,  Lombard,  soft,  and  about 
reapactively)  tdiils  tha  aavanth  waa  trained  uaing  a 
coapoaite  of  all  thaaa  conditions  (nulti-atyle).  Tha 
cepatral  mane  and  variarnsa,  averaged  over  all  words  in 
the  T1  vocabulary,  ovar  all  apeech  nodes  in  each  word,  and 
over  all  talkara,  ware  coaputed  for  eadi  of  tha  nodela 
above. 

Tha  naan  cepatral  shifts  (i.s.,  cepatral  mane  of  tha 
given  nodal  ainua  the  cepatral  naans  of  tha  noraal  nodal) 
for  each  of  the  cepatral  coafficienta  are  plotted  in  fig. 
1.  figure  1(a)  plots  naan  cspatrsl  shifts  for  four  caaeai 
soft;  shout;  average  of  faat,  loud  and  lombard;  and 
nultiatyle.  figure  1(b)  plots  the  corresponding  spectra  of 
these  man  shifts,  contrasting  the  affects  of  spectral  tilt 
of  low  vocal  affort  (soft)  va  hitler  vocal  effort  (faat, 
loud,  Laabsrd,  and  shout).  Increased  vocal  effort 
increases  the  relative  hit^i  frequency  content,  whereas  the 
opposite  occurs  with  low  vocal  effort. 

It  is  wall  known  that  spectral  tilt  srtiibita  large 
variation  tdien  s  talker  speaks  under  stress.  Such 
variation  usually  contaminates  the  distance  measure  and  ia 
one  of  tha  nost  significant  causes  of  recognition 
perforaance  degradation.  It  appears  that  the  effect  of 
spectral  tilt  could  be  compensated,  to  som  extent,  by 
applying  the  appropriate  cepatral  compensation  to  normally 
trained  word  nodela. 

Because  variance  estimation  is  less  reliable  than  man 
estimation,  we  have  only  coapared  cepatral  variancea  of 
multistyle-trained  models  which  used  11  training  totona 
with  the  normally  trained  nodela.  Their  ratios 
(multistyls/normal)  are'  plotted  in  fig.  1(c).  It  appears 
that  tha  major  styls-induced  variations  occur  in  tha  moat 
slowly  varying  spectral  components  (corresponding  co  lomr 
order  cepatral  coafficienta),  and  in  most  rapidly  varying 
spectral  components  (corrsapanding  to  the  hi^isr  order 
coefficients)1  • 


or  mors  seta  of  cepatral  differences.  The  word  nodolti  were 
talker-dependent,  but  the  modifications  were  tha  same  for 
all  words  and  all  talkers. 

(s)  Single  Model  Compensation!  The  set  of  cepatral 
mean  differences  and  variance  ratios  observed  in 
multistyle-trained  models  [represented  by  filled 
squares  in  fig.  1(a)  and  (c)]  was  applied  aa 
compensation  in  recognition  tests  on  all  styles. 

(b)  Multimodal  Compensation!  Three  sets  of  cepatral 
man  coapenaat ions  corresponding  to  the  soft,  the 
loud,  and  the  shout-trained  modelii,  were  applied 
to  generate  three  new  word  models.  The  variances 
in  these  models  wars  ncsled  according  to 
fig. 1(c).  In  recognition,  the  four  models 

(including  the  original  normal  model)  were 
treated  independently  and  equally;  in  affect,  the 
computation  for  WH  recognition  was  quadrupled. 

The  recognition  error  rates  of  these  experiments  ars 
listed  in  Table  11.  Tha  error  rate  reductions  relative  to 
the  baseline  system  seem  quite  promising  given  the 
simplicity  of  the  coapenaat ion  tschniqus. 

Tha  next  section  discusses  a  variation  of  the  above 
technique — the  hypothesis-driven  stress  compensation. 

CEPSTRAL' DOMAIN  STRESS  COMPENSATION  - 
A  HYPOTHESIS  DRIVEN  APPROACH 

It  ia  the  higi  coat  of  increased  computation  and  the 
uncertainty  about  training-style  sufficiency  and  efficiency 
that  proapted  ua  to  search  for  sltsrnatives.  Aa  a  reault 
of  this  effort,  the  hypotheais-drivsn  cep  trsl  mean 
compensation  technique,  which  adapts  to  ths  input  speech 
and  to  tha  hypothesized  reference  word,  was  developed, 
fixed  mult  let yle  variance  compensation  has  been  found 
beneficial  for  all  styles  and  will  be  used  in  conjuction 
with  ths  daptivr  man  compensation. 

In  deriving  this  tschniqus,  we  model  the  talker  as  an 
information  source  (fig.  2)  1  t  puts  out  a  sequence  of 
deterministic  cepatral  vectors  {c^}  .  8efore  the  vectors 
are  received  by  the  decoder,  we  assume  that  they  undergo 
two  stages  of  contamination. 

Stage  1 

A  sequence  of  independent  identically  distributed 
(i.i.d.)  r-ndom  vectors  {S^}  is  added  to  the  cepatral 
sequence  {c^}  to  create  a  new  sequence  {u^} 


The  sequence  {ijj  models  the  randomness  of  speech 
cepatral  parameter  outputs;  its  elements  are  aasumed  to  be 
normally  distributed  with  zero  mean  vector  and  diagonal 
covariance  matrix. 


The  following  cepatral  compensation  experiments  wre 
performed,  in  vWiich  new  wrd  models  wrs  generated  by 
modifying  normally  trained  Hidden  Markov  word  models  by  one 


1  In  »  diffarant  data  base  we  hev.  obeervsd  inflatsd  2The  subscript  t  is  an  index  of  time 

vinance  scaling  only  in  the  low  order  cosff iciente. 
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A  deterministic  but  isnknown  vactor  x  11  added  to 
the  equence  {m}  to  crests  ths  observation  equence 

W 


T  .[t 
n  L  a 
nisi 

n*n 


(7) 


vt  =  ut+X 


(2) 


Ths  vector  x  is  the  additive  "strata"  component.  It 
it  assumed  to  have  the  function*!  Torn  [see  Fig.  1(a)]: 


<>(Tn-0 


n*Yn 


(8) 


then  Tn  and  Yn  are  independent  and  the  expectation  can 
be  approximated  by 


Xi  =  .. 


-b(i-1) 


(3) 


and  is  further  assisted  to  remain  unchanged  within  a  word 
interval. 

Given  a  equence  of  observations  v^,  t3l,2,...,T  ms 
have  developed  a  procedure  for  estimation,  baaed  on  maximum 
likelihood  principles,  of  ths  parameters  a  and  b  in 
Equation  (3). 

The  procedure  consists  of  two  steps: 

Step  1  (Estimating  Xi) 

The  probability  density  function  of  vj,  the  ith 
component  of  the  observation  vector,  ia  given  by 


i  KW 

r(v  )x  — —  exp  - - ; - 

\/2ir<J. 
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where  c^and  x^ere  the  ith  components  of  the  cepatral  vector 
and  the  "stress"  vector,  respectively. 

Given  a  sat  of  independent  observations  {vj}, 
ta1,...,T,  ths  maximum  likelihood  entimetn  of  Xi  ia  given 
by 
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with  the  means  and  variancen  given  by 
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da  replace  the  sample  average  of  cit  which  ia  not 
observable,  by  the  expected  average  value,  derived  from  the 
word  hypothesis: 
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Ths  estimation  formula  (5)  becomes 
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tsl 


nsl 


(11) 


In  Equation  (11)  the  first  sua  is  over  the  observed 
cepatral  coefficient  equence,  and  the  second  eum  is  over 
the  nodes  of  the  hypothesized  model.  Therefore,  m  refer 
to  this  technique  as  a  hypothesis-driven  technique. 

Step  2  (Smoothing  x 


where  the  Tn's  are  a  sst  of  mitually  independent  discrete 
random  variubles  whose  values  represent  the  dwell  time  In 
sach  of  the  n  nodes,  and  the  rumrnstiona  are  over  all  speech 
nodes. 


Since  a  closed  formula  for  E  ~  has  not  been  found,  we  uee 
an  approximation  using  up  to  the  second-order  moments.  Let 


After  Xl*aa*Xi2  ir®  satiaated,  m  fit  Equation  (3)  to 
then.  A  leaat -mean-square  fit  requiree  numerically  solving 
a  set  of  nonlinear  equations.  *  A  leaa  computationally 
intensive  and  yet  more  robuet  fit  (i.e.v  one  which  is  less 
susceptible  to  the  effect  of  outlying  data),  ia  given  by 
fitting  tr  *»nential  functions  to  all  pairs  [Xi»Xj|»  i*J»  or 
a  eubaet  of  these  pairs*  and  then  by  averaging  magnitudes 
and  time  conetants  of  the  fits.  We  have  chosen  to  fit  the 

Fiairs  that  contain  Xl  one  of  X2fX3»X4  and  XSi  namely, 
Xl  ,X  jt ,  J=2, 3,4,5.  Therefare, 


vjrnx  flXiLX  slot  ?t j  TCiiyumi  v\  wvl 


-inji  ,  XlXj  >  0  .nd  |Xl|>|Xj| 


J  jO  ,  otherwise, 

•  and  b  are  the  average  or  non-zero  “j 'e  and  bj'a. 

Given  the  cepatral  vectors  of  e  teat  token  and  the 
Hidden  Harkov  word  oodel  Tor  a  reference,  the  procedure  for 
the  adaptive  cepatral  compensation  and  recognition  is 
described  os  follower 

Step  1 i  Compute  a  set  of  stress  components  [c.f.  Eq.  (11)]. 

Step  2i  Smooth  the  stream  components  by  fitting  an  expo¬ 
nential  fine'  ion  to  thorn  [c.f.  Eqs.  (3)  and  (12)]. 

Step  3i  Subtract  the  values  of  the  exponential  f met  ion 
from  the  cepatral  vectors  of  the  test  token. 


Step  A i  In  recognition,  perform  likelihood  tests  using  the 
compensated  test  tokens. 


In  Table  III  we  summarize  tha  recognition  error  rates 
whan  the  hypothesis-driven  stress  compensation  is  applied 
to  the  "simulated  stress"  data  boss.  For  comparison,  ths 
error  ratss  or  ths  baisline  and  of  suit imo del  compensation 
are  also  included.  This  technique  has  also  been  applied  to 
s  mors  advanced  14-node,  fixed-variance  WM  system  [1  ] 
whose  parameters  contain  cepatral  coefficients  os  wall  ss 
differential  cepatral  coefficients.  Because  cepatral 
variances  srs  fixed  in  this  recognizer,  ..J  variance  scaling 
is  performed.  Ths  recognition  results,  with  and  without 
cepatral  compensations,  are  listed  in  Table  IV. 


A  confidence  interval  analyaia  indicates  that  ths 
improved  error  rates  in  Tables  III  and  IV,  6.22  end  1.92, 
lie  well  outaide  the  952  confidence  intervale  or  the 
unimproved  error  ratea,  13.92  end  2.52.  Therefore,  our 
experimental  reaulte  are  etatiaticslly  significant. 

CONCLUSION 

Spectral  tilt  has  been  found  to  vary  significantly  for 
speech  apoken  in  stresaful  talking  anvironmanta.  We 

atudied  the  atatiaticel  variations  of  cepatral  coefficients 
embedded  in  the  framework  of  Hidden  Markov  modals  and  that 
ths  observed  chsngea  in  cepatral  mean  values,  from  normal 
exponential  type  of  apectrsl  tilt.  A  simple  and  efficient 
compensation  technique,  the  hypothesis-recognition 
experiments  yielded  significant  reduction  in  error  rate. 
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ABSTRACT 

A  new  training  procedura  called  mu i t  i-aty ie 
training  hsa  been  developed  to  improve  performance 
when  a  recognizer  ia  uaed  under  atreaa  or  in  high 
noiae  but  cannot  be  trained  in  theae  conditions. 
Inatead  of  speaking  normally  during  training, 
talkers  uae  different,  eaaily  produced,  talking 
styles.  Thia  technique  waa  teated  uaing  a  speech 
data  baae  that  Included  streaa  apeech  produced 
during  a  workload  taak  and  when  intense  noise  waa 
presented  through  earphonea.  A  continuous- 
distribution  talker-dependant  Hidden  Markov  Model 
(HMM)  recognizer  waa  trained  both  normally  (5 
normally  spokan  tokena)  and  with  multi-atyie 
training  (one  token  each  from  normal,  fast,  dear, 
loud,  and  quea t lon-p Itch  talking  styles).  The 
average  error  rate  under  atreaa  and  normal 
conditiona  fail  by  more  than  a  factor  of  two  with 
multi-atyie  training  and  the  average  error  rato 
under  conditiona  sampled  during  training  fell  by  a 
factor  of  four. 

INTRODUCTION 

The  performance  of  current  recognition 
systama  often  degrades  dramatically  aa  a  talker'a 
speech  characteristics  change  with  time,  when  a 
talker  is  under  normal  levels  of  workload  or 
paycho  log  leal  atreaa,  and  when  c  talkar  is  in  a 
high  noise  environment.  New  techniques  to  prevent 
this  degradation  hava  been  davelopad  and  teated 
with  a  number  of  data  bases,  including  a  naw 
Lincoln  at reaa- ape e ch  data  base.  In  thla  papar  we 
first  review  reaulta  obtained  with  thla  apeech 
data  baae  and  then  provide  detailed  information  on 
the  effecta  of  multi-atyie  training.  Other  papers 
In  this  proceedings  deacribe  discriminant  analysis 
[1]  and  cepatral  streaa  compensation  [2]  and 
present  reaulta  obtained  with  another  speech  data 
base  [ 3  ] . 

Lincoln  St reaa-Speech  Data  Baae 

The  Lincoin  st reaa -apeech  data  base  includes 
words  spoken  with  eight  talking  styles  (normal, 
slow,  fait,  soft,  loud,  dear  enunciation,  angry, 
queation  pitch)  and  under  three  streaa  conditions. 

A  difficult  mot  o  r -workload  taak  [4]  wes  uaed  to 
create  saay  (cond50)  and  more  difficult  (cond70) 
workload  stress  conditiona  that  smuiate  the  type 
of  workload  streaa  experienced  when  driving  a  car 
or  flying  an  airplane.  A  third  streaa  condition 


wea  created  by  presenting  85  dB  SPL  of  speech¬ 
shaped  noiae  through  earphones.  Thia  produces  the 
so-called  Lombard  effect  [0]  whera  a  talker  speaks 
ioudar  and  often  more  cieariy  when  in  noise.  This 
is  the  main  cauae  for  recognizer  degradation  in 
noiae  In  aituatlona  where  an  ecouat 1 cai i y-ah  ie  ided 
c ioae- 1 a  1 k  ing  microphone  minimizea  the  effect  of 
additive  noise.  The  data  baae  vocabulary 
contained  35  difficult  aircraft  worda  with 
acoustically  similar  subsets  such  aa  go,  hello, 
ah,  no,  and  zero.  A  total  of  11,340  tokena  were 
obtained  from  9  mala  talkers  during  three  aessiona 
per  talker  spanning  a  four  week  period. 

HMM  Recognizer 

The  baaeline  con t lnuoua-dis t r i but  ion  HMM 
recognizer  described  in  [4]  waa  uaed  for  ail 
axperiments.  It  ia  a  le f t-to-r igh t  isolated-word 
recognizor  with  multivariate  Geuaaian  distribu¬ 
tions  and  diagonal  covariance  matrices  where 
obaervatlona  conalat  of  centlaecond  mel-acale 
cepetral  parameters.  Unleaa  otherwise  stated,  sli 
raaults  were  obtained  using  10-node  word  models 
created  uaing  five  training  tokena  per  word  with 
the  forward-backward  algorithm  [5]  and  using  the 
Viterbi  algorithm  [5]  during  recognition. 

RESULTS  WITH  LINCOLN  ST  RE SS- SPEECH  DATA  BASE 

Figure  1  presents  an  overview  of  reaulta  in 
rough  chronological  order  obtained  using  a  number 
of  different  tachnlquea  with  the  Lincoin  stresa- 
apeech  data  base.  The  initial  error  rate, 
averaged  over  all  conditiona  excluding  the  moat 
difficult  angry  condition,  waa  17. 55.  A  similar 
high  error  rate  waa  obtained  with  a  new,  high 
performance,  commercial  recognizer.  Poor  perfor¬ 
mance  for  the  initial  Lincoin  system  and  the 
commercial  system  waa  caused  by  the  difficult 
vocabulary  and  streaa  conditiona  and  by  the  fact 
that  only  no r ma 1 1 y -ap oken  speech  waa  used  in 
training.  The  initial  Lincoin  recognizer  was  the 
baaeiine  system  with  variance  limiting  [4]  which 
limits  the  variance  eatimatea  obtained  during 
forward-backward  training  to  be  above  a  specified 
lower  limit.  The  high  initial  srror  rate  waa  more 
than  halved  to  6.95  using  multi-style  training.  In 
this  case,  the  five  tokens  used  during  training 
were  taken  from  the  normal,  fast,  clear,  loud,  and 
ques t ion -p i t ch  talking  stylss  instead  of  only  from 
the  normai  style.  Multi-style  training  halvsd  the 
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error  rate  with  no  increase  in  computation 
requirements. 

The  newt  large  reduction  in  error  rats  (from 
6.95  to  3.25)  was  obtained  by  doubling  tha  number 
of  parametere  used  in  the  observation  vector.  The 
originel  vector  of  16  cepatrsl  parameters  waa 
supplemented  with  16  additional  differential 
parameters  which  were  the  differences  between  the 
current  16  perametsrs  and  the  parameters  computed 
20  me  earlier.  This  differential  parameter 

technique  was  also  recently  used  by  [6].  It 
reduces  the  error  rataf  but  eleo  doubles  the 
recognition  computation  r equ i rams nt e .  The  next 
large  decrease  in  error  rate  (from  3.25  to  1.65) 
was  obtained  by  using  g r end-va r i ence  estimates. 
Instead  of  estimating  the  variance  of  each  of  the 
32  obeervation  parametars  eeparateiy  for  each  node 
of  every  word  model,  the  grand  variance  of  each 
observation  parameter  wae  .  estimated  once  across 
all  word  models  end  all  nodes  during  training. 
Using  grand  variances  reduces  the  degradation  in 
performance  caused  by  using  a  statistical  modal 
that  ie  too  complex  for  the  amount  of  training 
date.  This  result  rstnforcea  past  results  that 
demonetrate  the  necessity  of  matching  ths 
complexity  of  a  model  to  the  amount  of  training 
date  [7].  Using  grand  variancee  halved  tha  error 
rate  while  eimul t eneous  1  y  decreaeing  recognition 
computation  rngui roments  •  Ths  final  largs 

reduction  in  srror  rate  (1.65  to  1.05)  was 
obtained  using  the  two-stage  discriminant  analysis 
ayetem  described  in  [1].  This  system  focusaa 
attention  on  thoes  parts  of  oftsn  confuasd  words 
that  ere  most  different  and  reduesa  the  error  rate 
with  only  a  slight  increase  in  recognition 
computation  requirements .  The  final  system  with  a 
15  error  rate  across  many  atrssa/style  conditions 
is  a  usable,  practical,  robust  rscognizer  that 
could  be  ueed  for  a  variety  of  speech -rs cogn  it  ion 


EFFECTS  OF  MULTI-STYLE  TRAINING 

More  dateila  on  the  effects  of  multi-style 
training  from  the  experiments  deecribed  above  are 
presented  in  Figa.  2  to  4.  Figure  2  comparsa 
reaulta  with  normal  and  muiti-etyle  training  for 
the  aix  novel  conditione  not  sampled  during 
treining  ee  wsll  se  for  no rmal i y- spoken  speech. 
These  are  rop r eeent et i ve  rssuite  for  the  situation 
where  a  recognizsr  cannot  bs  trained  under  live 
stress  conditione.  Ths  percentage  error  rate 
averaged  over  all  nine  talkers  ie  presented  for 
normal  speech,  for  speech  spoken  slowly,  for  the 
easy  (cond50)  end  the  more  difficult  (cond70) 
workload  taek,  for  soft  epeech,  for  speech 
produced  in  ■'liae  (Lombard)  and  for  angry  epeech. 
Muiti-atyle  training  reducse  the  error  rete 
substantially  for  all  conditions.  The  average 
error  rate  over  all  conditions  feii  by  more  than  a 
factor  of  two  from  20.75  to  9.85.  The  drop  in 
error  rate  is  large  (  to  2.95)  even  for 

normally  spoken  words  and  greatest  for  the  Lombard 
and  angry  conditions. 

Figure  3  shows  the  results  when  the 
recognizer  was  tested  under  the  same  conditions 
sampled  during  training.  Here,  the  average  error 


rats  over  all  conditions  foil  by  a  factor  of  four 
from  18.45  to  4.65.  It  should  be  noted  that  in 
thsss  and  othsr  experiments,  training  word  tokena 
wars  never  used  during  teeting. 

Further  experiments  were  performed  to 
determine  whether  more  effective  subsets  of  five 
styles  could  be  found  and  whether  fewer  than  five 
different  stylee  could  provide  large  improvements. 
These  experiments  suggest  that  the  five  styles 
selected  are  more  effective  than  other  subsets  of 
the  sight  stylee  in  ths  a  t  rees -speech  data  base 
and  that  all  five  different  etylea  are  required 
for  bset  performance  with  muiti-etyle  training. 
Further  experiments  havo  also  been  performed  to 
explore  the  effects  of  multi-style  training  with 
more  advanced  HMM  iso  1  at ed- wo r d  talker-dependent 
rscognizere.  He  have  found  that  multi-style 
training  eiwsye  improvee  overall  performance.  For 
exampis,  ths  srror  rate  for  an  advanced  recognizer 
with  differential  paremetere,  grand-var i ance 
setimates,  14  nodes,  end  five  training  tokena, 
drops  from  3.25  to  1.45  with  multi-style  training. 

One  surprising  result  evident  in  Figa.  2  and 
3  is  that  the  srror  rate  dropa  for  normr'  speech 
when  ths  recognizer  ie  trained  on  non-normal 
training  toksne.  Thie  ie  caused  by  day-to-day 
variability  in  normal  speech  as  demonstrated  in 
Fig.  4.  Figure  4  presents  the  srror  rate  with 
normal  and  muiti-etyle  training  for  normal  speech 
recorded  in  the  first,  second,  and  third  recording 
sseelone.  Ae  can  be  asen,  multi-style  training 
snd  normal  training  produce  similar  results  in 
session  one,  but  multi-etyls  training  is  superior 
in  eeeeions  two  and  three.  These  reauits 
dsmonstrats  thet  multl-styis  training  can 
compensate  for  variability  in  normal  speech  over 
tims,  and  that  five  normal  treining  tokens 
recorded  in  one  esssion  are  lees  representative  of 
normal  tokens  recorded  one  to  three  weeks  later 
than  five  muiti-etyle  tokens. 

0ISCUSSI0N 

Multi-atyie  training  improves  performance  for 
the  novel  str*ee  conditions  because:  (1)  the 
forward-backwv.d  treining  algorithm  and  statisti¬ 
cal  decoding  focuses  attention  on  spectral/ 
temporal  regions  that  are  consistent  across  styles 
end  (2)  spsech  sempies  are  presented  during 
training  thet  ere  similar  to  those  that  occur 
during  testing.  For  example,  loud  speech  is 
similar  in  many  ways  to  ape.ech  produced  under  the 
Lombard  condition.  The  improvement  in  performance 
with  conditions  sampled  during  training  was 
greater  than  'the  improvement  with  novel  untrained 
condition  for  this  aecond  reason. 

A  cerefui  analysis  of  differences  between 
word  models  obtained  using  normal  and  multi-style 
training  and  of  recognizer  confusions  indicated 
that  improvements  are  caused  by  two  mam 
mechaniama.  First,  estimates  of  the  mean  and 
variance  of  the  cepstral  parameters  used  in  HMM 
word  models  are  more  representative  of  those 
observed  during  testing  with  multi-style 
training.  This  is  illustrated  in  Fig,  5.  The 
left  side  of  this  figure  presents  the  difference 
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ABSTRACT 

This  papsr  dsscribss  s  two-atege  laoistsd 
aord  spsach  rscognltlon  ayataa  that  usss  a  Hidden 
Harkov  Nodal  (HNN)  recognizor  in  tha  first  stags 
and  s  dlscrlalnsnt  analysis  systsa  in  tha  aaeond 
stags.  During  raeognition,  ahsn  the  rirat-ataga 
raeognlzar  is  unable  to  eisariy  dlffarsntiats 
batassn  aeouatiesiiy  sialiar  aorda  such  as  "go* 
and  "no*  tha  ascond-atsga  diacriainator  la  used. 
The  aacend-atags  ayataa  foeuasa  on  thoaa  parts  of 
the  unknoan  tokan  ahlch  ara  aoat  offset iva  at 
diaerialnat ing  the  confused  aorda.  Tha  ayataa  aaa 
tasted  on  a  33  aord,  10,713  tokan  atraaa  spsach 
isolatad  aord  data  baaa  ersstad  at  Lincoln 
Laboratory.  Adding  the  aacond-ataga  diaerialnat* 
ing  ayataa  produead  tha  baat  raaulta  to  data  on 
this  data  baaa,  radueing  tha  ovsrali  arror  rata  by 
aora  than  a  factor  of  two. 

1.  INTRODUCTION 

A  tso-ataga  diaerialnant  anaiyaia  ayataa  hsa 
been  davaiopad  to  addraaa  aoae  of  tha  prebiaaa 
gansrally  sneountarsd  in  currant  Hlddan  Narkov 
Nodal  (HNN)  iaoiatad  uord  raeognition  ayataaa. 
Thaaa  probiaaa  ineiudat  (1)  tha  affseta  of  liaitad 
training  data  ara  not  aapllcitly  takan  into 
account  i  (2)  tha  eorraiation  bat  wean  adjaeant 
observation  rraasa  is  ineorraetiy  aodaledi  (3) 
durations  of  acoustic  avanta  ara  poorly  aodaladi 
and  (A)  fasturaa  which  alght  ba  iaportant  In 
diaerialnat  ing  only  aaong  apaeifie  word  pairs,  or 
asta  of  words,  ara  not  aaaily  ineorporatsd  Into 
tha  ayataa  without  degrading  ovaraii  parforaanea. 
The  two-ataga  ayataa  uaaa  new  atatiaticsl 
taehniquaa  that  aapllcitly  account  for  tha  ilaitad 
saounts  of  training  data  avallabia  for  tslkar- 
dependant  raeognition.  Tha  aacond-ataga  ayataa 
focuaaa  its  attention  on  thoaa  paraaatara  in  tha 
aodaia  which  are  aoat  sfrectlva'ln  diaerialnat  ing 
between  words  which  sehisvs  siailsr  seoraa  in  tha 
first-stags  HNN  ayataa.  A  siailsr  two-ataga 

ayataa  was  davaiopad  by  Rablnar  and  Mlipon  [9]. 
That  ayataa  was  davaiopad  in  tha  contest  of  a 
Dynaalc  T laa  Harping  (DTH)  rsthsr  than  an  HNN 
ayataa,  arid  also  did  not  aapllcitly  taka  into 
r--nunt  tha  sfracta  or  ilaitad  training  data. 
Another  approach  to  tha  focua-of-sttantion  problsa 
in  diaerialnat  ion  is  prasantsd  in  [6].  Tha  new 
two-stags  diaerialnant  ayataa  described  hare  was 
davaiopad  as  part  of  a  larger  effort  siasd  at 
reducing  tha  affseta  of  atraaa  on  robust  apaseh 
recognition  ayataaa  [1,4,7, 8]. 


2.  DVERVIEH 

The  structura  of  the  two-stags  ayataa  is 
shown  in  Fig.  1.  Each  word  aodsi  in  tha  first- 
stags  HNN  raeognizsr  ia  crested  using  forward- 
backward  training  and  training  toksna  for  that 
word  [3].  A  datallad  dsaeription  of  the  HNN 
raeognlzar  which  is  uaad  ia  given  in  [4,7,8].  Tha 
raeognlzar  uaaa  a  cont inuoua-d lat r lbut ion  epeaksr- 
dapandant  10-noda  HNN  aodal  with  16  eapstral 
cosffleiantc  that  era  assuaad  to  ba  Jointly 
Cauaalan  and  lndapsndant.  Tha  HNN  aodai  is 

trainad  using  fiva  tokana  par  word  with 
sulti-atyla  training  [4,7]  and  varisnes  iiaiting 
[7,8]. 

Tha  ascond-atsga  diaerialnant  ayataa 
calculates  atatlatiea,  on  tha  cspatral  psrsastsrs 
and  on  aalsetad  additional  paraaatara  (aaa  below), 
for  aaeh  word  aodsi,  vocabulary  word,  and  noda  by 
dacodlng  training  tokana  of  all  words  using  tha 
Vitarbl  algor itha  with  HNN  word  aodaia  fnr  all 
words.  Thla  "cross-word  training*  providas 

additional  atatisticai  inforaation  which  is  not 
avallabia  In  standard  HNN  training,  whsrs  sach 
word  aodsi  la  trainad  only  on  aaapiss  of  that 

word.  Ouring  raeognition,  diaerialnant  decisions 
ara  boaad  on  ilkallhood-rstio  eoapariaons  aaong 
all  word  pairs  in  tha  top  N  words  froa  the  HNN 

ayataa.  The  eoapariaons  saauaa  that  diaerialnant 
atatlatiea  ara  Jointly  Cauaalan  and  lndapsndant. 
In  addition,  a  new  tschnlqua  caliad  'sifting*  is 
appliad,  Mich  uaaa  a  atatisticai  "T*  test  to 

focus  attention  only  on  diaerialnant  statistics 
that  wars  judged  froa  tha  training  data  to  be 
statist ieally  dlffarant,  for  apaeifie  word  pairs. 

Tha  diaerialnant  ayataa  sea  tasted  using  tha 
Lineoin  Laboratory  at rsaa-apsseh  data  baas  [4,8]. 
This  includes  10,710  words  froa  nine  talkers 

producing  33  seouatlcsiiy-siaiiar  aircraft  words 
spokan  noraally,  under  workload  atraaa,  in  noiaa 
praasntsd  ooar  asrphonsa,  and  with  seven  dlffarant 
talking  stylaa. 

3.  DISCRIMINANT  TRAINING 

Figure  iluatratsa  the  flow  of  data  for  the 
diaerialnant  training  procaaa.  During  training, 
ail  tokens  of  oil  training  words  wars  dacodod  by 
all  of  tha  first  stags  HNN  word  aodaia.  Each 
dacods  raaultad  in  a  ssgasntstion.  Froa  this 
proesdurs  a  atatisticai  dsaeription  was  obtained 


5.  EXPERIMENTS 


characterizing  tha  distribution  of  obaarvatlons 
that  ware  aaalgnsd  to  each  noda  of  each  word 
nodal,  given  a  specific  Input  word.  Each 
aatlaatad  distribution  waa  aodalad  aa  Gaussian. 
This  Information  was  than  storad  as  two  four¬ 
dimensional  i  rrayai  a  naan  and  variance  array 
indexed  by  word  nodel,  Input  word,  node  within  the 
nodal,  and  parumeter.  Given  a  token  aegaented  by 
a  word  nodel,  these  ntatiatlca  were  then  available 
to  be  used,  during  recognition,  to  calculate  a 
1 iksl ihood-rat  lo  between  any  two  hypothesized 
input  words. 

4.  DISCRIMINANT  RECOGNITION 

During  recognition,  unknown  tokens  were  first 
passed  through  the  HHH  systsa.  likelihood  scores 
from  the  Vitsrbi  algoritha  wars  calculated  for 
each  word  nodal.  In  cases  where  scores  for  two 
nodels,  say  for  words  A  and  B,  ware  clearly  better 
then  all  other  aodsla  yet  were  vary  aiailar  to 
each  other,  the  second  stage  ayatsa  was  used. 
During  the  HHH  peas,  sagasntatlon  by  each  word 
nodel  unsigned  each  input  obaarvation  to  a 
specific  nods  in  that  nodal.  Par-nod.  observa¬ 
tions  warn  used  with  the  diacriainant  training 
statistics  to  separately  calculate  the  likelihood- 
ratio  between  the  inputs  bning  A  and  B  given  the 
sagasntatlon  froa  both  tha  A  and  B  word  aodsla. 

An  effort  was  nade  to  aaparata  tha  scoring 
based  on  duration  lnforaatlon  froa  othar  aspects 
of  the  scoring.  To  iapleaant  this,  likelihood- 
ratio  scores  for  any  input  token  were  calculated 
on  a  par-node  basis  rather  than  on  a 
per-oboarvat  ion  baaia.  This  was  achieved  by  first 
calculating  the  likelihood-rat lo  based  on  all 
obaarvationa  in  a  noda,  than  normalizing  this 
score  by  tha  number  of  obaarvationa  assigned  to 
that  noda.  The  advantage  of  this  scheme  was 
two-foldi  first,  It  reduced  the  weighting  of 
certain  nodes  which  might  dominate  in  the  final 
acore  bsceuso  of  the  large  number  of  obaarvationa 
assigned  to  those  nodes,  and  secondly,  it 
eliminated  the  assumption  made  with 

par-obaervat ion  scoring  that  all  obaarvationa  are 
otat 1st ically  independent.  On  the  contrary,  this 
"per-node"  scheme  assumed  a  very  atronj  correla¬ 
tion  between  observations  assigned  to  tha  seme 

node. 

It  should  be  observed  that  this  par-node 
scoring  technique  removes  duration  information 
from  the  scoring.  This  was  desirable  since  it 
eneblad  the  duration  Information  to  than  be 

explicitly  modeled  and  included  as  e  separate 
feature  into  the  scoring  mechanism.  To  facilitate 
this,  two  more  arrays  were  generated  during  the 
discriminant  training  procedure  described  above. 
A  mean  and  variance  array  were  generated  modeling 
the  number  of  obaervatione  assigned  to  each  node 

by  e  word  model  1  ven  each  input  word.  Thi^i 

information  wee  stored  aw  two  th  res -d  imene  iona  1 
arrays  indexed  by  word  model,  node  in  the  model, 
and  input  word. 


Since  two  likallhood-ret lo  scores  wars 
calculated  for  each  pair-wise  diecriminst ion, 
corresponding  to  the  segmentation  irieing  from  the 
pair  of  word  models,  a  scheme  had  to  be  devised  to 
account  for  poeaible  disagreement  from  these  two 
ecorsa.  For  initial  experiment!  it  w  a  decided 
that  if  discriminant  score  i  ditjgreed,  the 
decision  would  simply  be  deferred  back  to  the 
original  scores  from  the  HHH  system. 

A  daciaion  also  had  to  be  made  on  criteria 
for  deciding  when  the  second-etege  should  be  used, 
and  how  many  of  the  top  candidate  words  should  be 
considered.  To  simplify  initial  experiments  a 
hard  threshold  was  established  on  the  difference 
between  the  top  two  HHH  word  acoree.  If  the 
difference  between  the  two  leading  scores  exceeded 
the  threshold,  the  second  stage  wee  not  used.  It 
was  later  found  that  reaulta  were  relatively 
inaenaitive  to  changes  in  th  i&  threshold. 
Discriminations  were  limited  Initially  to  consider 
only  the  top  two  candidate  wordu. 

The  HHH  system  -elected  as  the  first  stage 
was  at  the  time,  the  beat  system  tested  on  the 
Lincoln  database,  achieving  an  error  rate  of 
7.7S.  A  mere  detailed  description  of  this  system 
la  included  in  [4]  and  [7]. 

5 . 1  Estimated  Variance 

The  first  ixperiment  with  the  two-stage 
system  used  all  the  cspetral  parameters,  aa  well 
aa  the  duration  and  energy  parameters,  in  the 
second-stage  diacr iminat or  •  Performance  of  this 
ayatem  was  mediocre.  The  overall  error  rate  fell 

from  7. 78  with  the  basic  HHH  system  to  7.45  with 

the  two-atage  system.  It  was  auepected  that  pert 

of  the  reason  for  the  disappointing  performance 
might  be  that  only  a  subset  of  the  parameters 
contributed  positively  to  d is cr  im  i  -  a t ion .  To 
inveritigate  the  effectiveness  of  individual 
parameters  ae  discriminatore  another  experiment 
was  performed  which  used  only  a  single  parameter 
in  all  second  stags  diacr im lnat  ione .  This 

experiment  was  repeated  for  each  available 
parameter.  These  included  the  sixteen  cepatral 
coefficients  from  the  HHH  system,  a  relative 
energy  measurement  and  e  node  duration 
measurement.  Results  using  only  the  two  best 
parameters  (duration  and  relative  energy),  each 

individually,  mhowed  improvements  over  the 
previous  experiment,  where  *11  parameters  were 
in:luded  In  the  discrimination.  .In  partial 
explanation  of  thit  result,  it  should  be  noted 
that  when  a  very  complex  model  ia  made  for  a 
ayatem  and  very  limited  training  deta  is  available 
to  characterize  it,  statistical  noise  from  poor 
eatimetion  can  degrade  performance.  In  the  above 
experiment,  the  model  waa  simplified  to  better 
match  the  amount  of  training  data  available.  Less 
etatieticel  noiee  waa  then  introduced  to  the 
scoring,  and  because  of  this,  overall  system 
performance  improved.  The  overall  error  rate 
dropped  from  7.45  to  7.05  uaing  only  a  3ingle 
parameter  (the  duration  parameter)  in  the  second 
stage. 
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models.  Ths  lyatam  also  applies  a  atat  ietlcally- 
baaed  "sifting"  technique  tn  focus  its  attention 
on  paraaatsra  which  aost  affectively  bring  out 
differences  in  the  words  being  discriminated. 
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Fig.  1.  Block  diagram  of  two-atage  ayatem. 


Fig.  2.  Training  for  discriminator. 
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Fig.  3.  Capetral  parameters  used  to  discriminate 
the  word  models  For  "go"  and  "oh"  are  indicated 
with  darkened  regione.  Most  of  the  parameters 
ueed  are  concentrated  toward  the  beginning  nodes. 
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ABSTRACT 

This  paper  documents  the  DARPA 
Resource  Management  (RM)  Task  Domain 
speech  data  base,  which  is  intended  to  be 
used  in  the  evaluation  of  speech  recogni¬ 
tion  systems  that  may  incorporate  a 
higher— level  language  model.  The  prompts, 
contributed  by  BB&N,  were  taken  from  a 
specific  sub-language  model.  In  addition 
to  full  sentence  utterances,  "spell  mode" 
word  spellings  were  recorded.  For  use 
primarily  with  speaker-independent  recog¬ 
nizers,  57  utterances  were  recorded  from 
each  of  160  speakers;  a  speaker-dependent 
set  of  data  is  provided  by  recordings  of 
1012  utterances  from  each  of  12  speakers. 
The  speakers  were  selected,  with  the  help 
of  SRI,  from  among  the  630  speakers  re¬ 
corded  previously  in  the  pan-dialectal 
TIMIT  Acoustic-Phonetic  data  base. 
Recording  formats  and  facilities  were  the 
same,  with  the  exception  of  an  improvement 
in  suppression  of  background  noise . 


1.  INTRODUCTION  AND  BACKGROUND 

The  RM  task  domain  data  base  was 
designed  during  1986,  in  collaboration 
with  NBS  CMU,  BB&N,  and  SRI.  Both 
speaker  independent  and  speaker  dependent 
phases  were  segmented  into  training,  de¬ 
velopment  test,  and  evaluation  test 
recordings .  Digital  tapes  of  the 
recordings  were  shipped  to  the  National 
Bureau  of  standards  (NBS) ,  for  further 
distribution  to  users,  during  the  course 
of  the  data  base  collection. 

The  original  plan  was  to  give  the 
speaker  independent  recordings  priority, 
substantially  completing  them  before 
starting  on  the  speaker  dependent 
recordings .  Recording  of  the  speaker 
independent  part  began  on  10/16/86;  its 
training  phase  was  completed  11/20/86  and 
its  development  test  phase  was  approaching 
completion  in  early  December  when  the 
decision  was  made  to  deliver  speaker 
dependent  data  and  speaker  independent 
data  to  users  at  about  the  same  rates. 
Speaker  dependent  recording  began  on 
12/10/86  and  was  given  priority.  Speaker 


independent  development  test  data 
recording  was  finished  on  12/17/86. 
training  and  development  test  sentence 
recordings  for  the  first  four  of  the 
twelve  speaker  dependent  recordings  were 
completed  on  2/10/87;  final  completion 
date  of  all  recording  was  3/25/87  for 
speaker  independent  data  and  3/26/87  for 
speaker  deprndent. 

The  purpose  of  the  task  domain  data 
base  is  to  provide  speech  data  limited  by 
a  language  model.  The  language  model, 
developed  at  BB&N,  covers  utterances 
appropriate  for  a  specific  naval  resource 
management  task.  TI  received  2835 
sentences  generated  from  this  model  as  a 
pool  from  which  to  draw  prompts .  A  subset 
of  600  of  these  sentences  had  been  hand 
picked  at  BB&N  as  training  sentences;  in 
the  following  explanations  these  may  be 
referred  to  as  the  PJSENT1  sentences.  The 
other  2235  RM  sentences  are  the  PJSENT2 
sentences .  Ten  sentences  from  the  same 
language  model  were  selected  at  SRI  as 
peculiarly  appropriate  for  rapid  phonetic 
adaptation.  At  TI  these  sentences  were 
formatted  to  normal  orthographic  standards 
and  used  as  prompts.  600  words  from  the 
vocabulary  of  the  language  model  were 
selected  and  made  into  prompts  for 
"spell-mode"  readings.  Subjects  also 
re-read  the  two  SRI  "dialect"  sentences 
that  were  used  to  calibrate  dialect  usage. 

Subjects  were  recruited  from  the 
sample  of  630  who  had  given  speech  earlier 
for  the  TIMIT  Acoustic-Phonetic  data  base. 
Selection  was  guided  by  an  analysis  of  the 
sub j  ects '  observed  phonetic 
characteristics ,  made  at  SRI .  160 
subjects  were  used  in  the  speaker 
independent  phase  and  12  subjects  in  the 
speaker  dependent  phase 


2.  STRUCTURE 

The  macro structure  of  the  data  base 
is  exhibited  in  the  two  figures  below. 
Figure  1  for  the  speaker  independent  phase 
and  figure  2  for  speaker  dependent. 
Subjects  are  arrayed  vertically  and 
utterances  or  sentence  productions 
horizontally. 
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Figure  1.  Speaker  Independent  Task  Domain  Data  Base  Layout 
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SR001  -  SR600 
ST0001-  ST2235 
SA1  -  SA2 
SP001  -  SP600 


600  training  RM  sentences  i.  PJbtNTSl 
2235  other  RM  sentences  from  PJSENTS2 
2  SRI  dialect  sentences 
600  Spell-mode  word  sentences 


Table  1.  Sentence  I.D.  Key. 
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4.  LEXICON 


In  the  speaker  independent  training 
phase,  each  of  80  speakers  read  40  RM 
sentences,  2  SRI  dialect  sentences,  and  15 
spell-mode  words.  In  all,  1600  distinct 
RM  sentences  (sentence  types  as  opposed  to 
tokens)  were  available  in  this  part, 
resulting  in  each  sentence  type  having  two 
tokens,  being  read  by  two  subjects.  The 
distribution  of  sentences  to  speakers  was 
arbitrary,  with  the  exception  that  no 
sentence  was  read  twice  by  the  same 
subject.  Both  SRI  dialect  sentences  were 
read  by  each  subject.  Each  speaker  read 
15  spell-mode  words,  yielding  (80x15)  1200 
productions,  which  were  covered  by  a 
selection  of  300  words .  Each  spell-mode 
word  was  thus  read  by  4  speakers. 

The  speaker  independent  development 
and  evaluation  test  sets  have  identical 
form  factors.  In  each,  40  speakers  each 
read  30  RM  sentences,  the  10  rapid 
phonetic  adaptation  sentences,  the  2  SRI 
dialect  sentences,  and  15  spell-mode 
words.  600  RM  sentence  types  were 
randomly  selected  for  each  test  and 
assigned  to  the  1200  available 
productions,  at>  in  the  tiaining  phase. 
Similarly,  150  spell-mode  words  were 
selected  and  assigned  to  the  600  available 
spell-mode  productions . 

For  speaker  dependent  training,  each 
of  the  12  subjects  read  each  of  the  600 
PJSENT1  RM  sentences,  the  2  SRI  dialect 
sentences,  the  10  rapid  phonetic 
adaptation  sentences,  and  a  selection  of 
100  spell-mode  words .  The  1200  spell-mode 
word  readings  thus  produced  were  covered 
by  a  selection  of  300  word  types, 
resulting  in  four  productions  per  word. 

In  the  speaker  dependent  development 
and  evaluation  test  sets,  each  of  the  same 
12  speakers  read  a  selection  of  100  RM 
sentences  and  50  spell-mode  words .  Two 
random  selections  of  600  RM  sentences  were 
made  from  the  PJSENT2  sentences,  one  for 
the  development  test  and  one  for  the 
evaluation  test.  Distributing  these  over 
the  (12x100)  1200  productions  available  in 
each  gives  2  utterances  per  sentence. 
Simarly,  two  random  selections  of  150 
words  each  were  made  from  the  pool  of  600 
spell-mode  words,  for  development  and 

evaluation  tests.  Distributing  these  over 
(12x50)  600  readings  available  yields  4 
subject  productions  for  each  word. 


3.  SENTENCE  IDENTIFICATION 

Each  sentence  that  was  read  has  an 
identifying  name.  This  sentence  i.d. 
appears  as  a  sub-field  in  the  name  of 
speech  files  holding  recordings  of  the 
associated  sentence.  Table  x  is  a  key 
showing  the  significance  of  the  different 
sentence  i.d  'a. 


In  order  to  have  a  uniform  and 
repeatable  scoring,  there  is  a  need  to 
specify  each  of  the  RM  sentences  in  terms 
of  a  string  of  recognition  units  from  a 
standard  lexicon.  It  seems  best  to  derive 
these  representations  from  the  prompts 
actually  used  in  the  collection  of  the 
data  base  instead  of  some  other  phase  of 
the  language  model,  since  they  are  the 
most  sure  representation  of  what  was 
probably  said. 

Dave  Pallett  (of  NBS) ,  who  is 
organizing  the  scoring  procedures,  after 
soliciting  and  considering  the  opinions  of 
interested  parties,  issued  a  memo  giving 
rules  for  converting  out  prompts  from 
normal  orthography  into  strings  of  these 
lexical  units.  We  call  such 
representations  SNOR's,  for  standard 
Normalized  Orthographic  Representations . 
In  this  kind  of  representation,  the 
lexical  units  (or  "words")  are  strings  of 
non-blank  characters  separated  by  a  blank. 
We  wrote  a  set  of  lexicalizing  rules  in  a 
qiMtsd-liirgBAstie  tnpiemetaitg 
Dave’s  rules  but  making  explicit  choices 
where  there  was  some  vagueness  in  his 
formulation.  These  rules  are  presented 
below  as  Figure  3. 

The  format  of  the  rules  is 
straightforward.  In  the  symbol-defining 
section,  certain  variables  are  defined 
that  range  over  specified  strings  of 
characters,  and  are  used  in  the  later 
definition  of  rules.  In  the  rule-defining 
section,  a  list  of  rules  for  transforming 
character  strings  is  given,  of  the  form: 

[A]  — >  CB]  /  CCD  _  CD]  ;  "comments" 

The  algorithm  for  rule  application  is 
simple .  The  rules  apply  to  map  an  input 
buffer  of  characters  into  an  output  buffer 
of  characters;  the  input,  left 
environment,  and  right  environment  fields 
of  each  rule  match  to  the  input  buffer, 
and  if  the  rule  applies,  the  output  field 
of  the  rule  is  added  to  the  output  buffer. 
A  cursor  is  initiated  to  point  to  the 
first  character  in  the  input  buffer;  at 
each  cycle,  the  list  of  rules  is  searched 
from  top  down  until  either  a  rule  is  found 
that  applies  of  the  end  of  the  list  of 
rules  is  reached.  If  a  rule  is  found  that 
applies,  the  output  of  the  rule  is  added 
to  the  output  buffer  and  the  input  buffer 
cursor  is  advanced  beyond  the  part  of  the 
input  buffer  that  was  matched  by  the  input 
field  of  the  rule.  If  no  rule  applies, 
the  single  character  that  the  input  buffer 
cursor  points  to  is  copied  into  the  output 
buffer  and  the  input  buffer  cursor  is 
advanced  by  one. 
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C  FILE  TTPSN0R2.RLS 

C  SET  OF  RULES  TO  CONVERT  T.I.  PROMPT  SENTENCES  INTO 
C  STANDARD  NORMALIZED  ORTHOGRAPHIC  REPRESENTATION  (SNOR)  FORMAT. 

*  INPUT  :  PCODEFILE=UD : [SPEECH . PH] CPASCII . DAT 

*  OUTPUT:  PCODEFILE=UD:  [SPEECH.PfflCPASCII.DAT 
C  RULE  FAILURE  ACTION  =  'PASS' 

*  *****  SYMBOL  SECTION  BEGINS  HERE  ***** 

>  i i f >  »/»?»/»! */’; 

-CAN+>  = 

SANUM  =  'T'/'N'/’R’/'S'/'H'/’D'/'L'/'F'/'C'/'M’/'G’/’P'/'W'/ 

*  ’B’/’V’/’K’/’X’/'J’/’Q’/’Z’/’E’/’A’/’O’/’I’/’U’/’Y’/ 

+  ’l’/’2’/’3’/’4’/’5’/’6’/’7'/’8’/’9V’0’/’  ZERO  ’ 

*  *****  RULE  SECTION  BEGINS  HERE  ***** 

*  RULE  FORMAT  B 

[S’]  =>  [S+S]  /  [#]  ;  weird  possessive  plural  formation  rule 


[’]  =>  [+]  ; 

[.]  =>  []  ; 

i 

’[?]  =>  []  ; 

[  ]  =>  [  ] 

[  ]  =>  [  ] 


"apostrophes  become  pluses"  (for  exception  see  TTPSN0R1 .RLS) 
"abbreviations  become  single  words,  no  end-of -sentence 
punctuation" 

"no  end-of -sentence  punctuation" 
delete  multiple  blanks 
delete  multiple  blanks 


special  hyphenated  idioms: 


[#DIEG0] 


[GARCIA] 


[ 

] 

=> 

[-] 

/ 

[#H0NG]  ” 

“[KONG] 

[ 

] 

=> 

[-] 

/ 

[#ICE]  * 

[NINE] 

[ 

] 

=> 

[-] 

/ 

[#LAT] 

[LONG] 

[ 

] 

=> 

[-] 

/ 

[#NEW]  _ 

[YORK] 

[ 

] 

=> 

[-] 

/ 

[#NEW]  ’ 

[ZEALAND] 

[ 

] 

=> 

[-] 

/ 

[#PAC]  ' 

[ALERT] 

[ 

] 

=> 

[-] 

/ 

[#SAN]  ' 

[DIEGO] 

[ 

] 

=> 

[-] 

/ 

[#SAN] 

[FRAN] 

C  correct  spelling  of“U  in  some  alphanumeric  strings: 

[ZERO]  =>  [0]  /  [-]  _ 

C  supply  weird  spellings  for  some  acronyms 
[CROVEL]  =>  [CROVL]  /  [#] 

□PACK]  =>  [PAC]  /  [#]  ' 

[TACKIN]  =>  [TACAN]  /  [#] 

[TASSEM]  =>  [TASM]  /  [#] 

[-FLEET]  =>  [FLT]  /  [LANT] 

[-FLEET]  =>  [FLT]  /  [PACK] 

C  alphanumeric  strings  spellecC  without  hyphens : 

[-]  =>  []  /  KAN+>$ANUM]  [$ANUM-CAN+>] 

[-]  =>  []  /  [.]  [$ANUMTXN+>]  ;  D.D.D. -2-4-3" 

C  NOTE:  THE  RIGHTTNVIRONMENT  IS  NEEDED  IN  THE  ABOVE  RULE  TO  PREVENT 
C  IT  FROM  APPLYING  TO.  FOR  INSTANCE.  "S . Q. Q. -23" ,  WHICH  SHOULD  BE  "S(W-23", 
C  NOT  "SQQ23". 


Figure  3.  Major  Rules  for  Lexicalizing  Prompts. 


The  rules  shown  in  Figure  3  are 
preceded  by  passes  of  rules,  not  shown, 
which  capitalize  all  letters,  delete  the 
apostrophe  in  such  abbreviated  dates  as 
"'87",  and  convert  numeral  strings  %-  o 
English  words. 

In  brief,  the  rules:  1.  eliminate 
punctuation;  2.  convert  letters  to  all 
capitals:  3.  replace  apostrophe  with  "+"; 
4.  combine  certain  "words"  into  single 
lexemes ,  using  hyphens ;  5 .  split  up 
certain  alphanumeric  strings  by  inserting 
blanks. 


TI  has  run  a  program  to  apply  these 
rules  to  all  prompts  used  in  this  data 
base  and  can  supply  the  resulting  SNOR 
strings  and  an  alphabetical  listing  of  the 
SNOR  lexicon  of  these  sentences  to  any 
interested  parties. 
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Recording  conditions  were  nearly  the 
same  as  has  been  reported  at  earlier  DARPA 
workshops;  in  brief:  subjects  were  seated 
in  a  sound-isolated  recording  booth;  the 
director  placed  a  Sennheiser  SN  414 
headset  microphone  on  the  subject  and, 
using  a  template,  positioned  a  B&K 
pressure  microphone  about  30  centimeters 
away  from  the  subject’s  mouth,  20  degrees 
to  the  left;  the  subject  was  instructed  to 
read  the  prompts  appearing  on  a  CRT  screen 
in  a  "natural"  voice;  and  speech  was 
digitized  directly  onto  disk  at  20  ksps 
per  channel .  The  automatic  recording 
software  system  STEROIDS  was  used.  Each 
recording  was  listened  to  by  both  the 
recording  director  and  the  subject  to 
check  for  errors. 

Before  this  data  base  recording 
began,  the  sound  booth  was  retrofitted 
with  a  steel  I-beam  subfloor  and  air 
spring  suspension  system  which  reduced 
low-frequency  <<100  Hz.)  noise  by  about 
20-25  dB. 

The  raw  recordings  were  split  into 
separate  files  for  each  channel,  filtered, 
and  down-sampled  to  16  ksps  as  before  and 
these  versions  of  the  speech  were  sent 
along  with  the  original  20  ksps  2-channel 
files  to  NBS. 


After  a  large  amount  of  speech  had 
been  sent  to  NBS,  they  discovered  that 
some  recordings  apparently  had  final  words 
clipped  off;  this  problem  was  called  the 
"zero-tail"  problem.  Some  investigation 
determined  that  the  original  recording  was 
all  right  and  that  the  problem  was  caused 
by  a  bug  in  general  speech  file  software 
that  was  introduced  with  a  program  change 
in  October.  It  was  a  "magic  number" 
problem;  only  about  2%  of  the  recordings 
made  with  the  buggy  software  were 
affected.  Recordings  that  had  not  been 
yet  shipped  by  the  time  the  bug  was  fixed 
were  corrected  before  shipping.  The 
recordings  that  had  already  been  shipped 
were  handled  in  a  different  manner:  two 
"errata"  tapes  have  been  prepared,  which 
contain  corrected  versions  of 
already-shipped  files  that  were  found  to 
have  zero  tails.  These  errata  tapes  are 
delivered  with  the  data  base,  clearly 
marked,  and  users  should  make  sure  that 
the  files  on  the  errata  tapes  are  used  in 
place  of  the  corresponding  files  with 
matching  name . 
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Introduction 

To  achieve  high  performance  continuous  speech 
recognition,  we  need  to  bring  to  bear  a  wide  variety  of 
sources  of  knowledge.  To  achieve  real-time  continuous 
speech  recognition,  we  must  implement  these  kuowledge 
sources  working  cooperatively  in  a  very  high  speed 
computing  environment,  probably  with  many  processors 
running  in  parallel.  This  paper  discusses  one  approach  to 
achieving  these  goals  at  a  reasonable  cost  in  the 
environment  of  a  single  workstation.  It  is  furthermore  a 
prime  goal  of  Dragon  Systems'  project  to  provide  an 
environment  in  which  knowledge  sources  of  many 
different  types  may  be  implemented. 

The  architecture,  as  seen  by  the  knowledge  source 
software,  should  be  capable  of  mixing  stochastic  knowledge 
sources  with  deterministic  knowledge  sources.  It  should  be 
capable  of  combining  rule-based  knowledge  representations 
with  pattern-matching  based  knowledge.  It  should  provide 
for.  both  e  tyametric  and  non- parametric  statistical 
procedures.  Finally,  it  should  facilitate  the  independent 
development  of  separate  knowledge  sources,  possibly  at 
remote  sites. 

Of  course.  Dragon  Systems  is  not  done  in  working 
towards  these  overall  goals.  It  is  not  claimed  that  we  are 
anywhere  close  to  a  complete  solution  to  the  problem  of 
many  different  knowledge  sources  cooperating  in  a  real¬ 
time  environment.  Rather,  the  opposite  is  more  nearly 
true — research  on  different  knowledge  sources  cooperating 
in  a  real-time  environment  is  likely  to  be  of  value 
specifically  because  of  the  current  fr mgmentary  state  of  our 
knowledge.  Getting  different  kinds  of  knowledge  sources 
to  work  together  is  still  very  much  an  exploratory  activity. 
In  this  light,  the  current  project  is  not  trying  to  find  an 
optimal  multi-procesor,  multi-knowledge  source 
architecture,  but  merely  an  adequate  net, 

An  important  objective  of  this  project  is  to  develop 
techniques  which  apply  not  only  to  a  substantial  variety  of 
it*  algorithm  ttni  m  law*  today*  bat  also  to  U» 
algorithms  that  we  might  invent  in  the  future.  Therefore, 
both  the  hardware  architecture  and  the  software 
architecture  must  be  general  purpose,  not  tailored  to  a 
particular  class  of  algorithm. 


Hardware  Architecture 

The  software  architecture,  discus.»3d  in  more  detail 
below,  is  specifically  designed  to  be  compatible  with  many 


of  the  existing  parallel  processor  architectures.  To  meet 
the  goal  of  fitting  in  a  single  workstation  at  moderate  cost, 
however,  the  architecture  shown  in  figure  1  has  been 
chosen.  The  general  framework  is  a  simple  tree:  the  host 
processor  in  the  workstation  itself  with  several  clusters  of 
processors,  with  each  cluster  implemented  on  one  or  two 
boards  that  plug  into  the  peripheral  bus  of  the  workstation. 
Each  cluster  has  a  local  system  bus  with  a  memory  that  is 
shared  by  the  host  and  all  the  processors  in  the  cluster. 
Each  processor  in  the  cluster  also  has  a  substantial  amount 
of  local  memory  of  its  own.  The  'knowledge’  of  the 
individual  knowledge  sources  is  stored  in  these  local 
memories. 

This  architecture  is  not  intended  as  a  great 
innovation,  similar  architectures  have  been  done  before. 
Rather,  it  is  a  simple  and  reliable  means  of  fitting  a  large 
amount,  of  general  purpose  computation  in  a  small  space. 
With  the  multi-processor  board  that  Dragon  has  designed, 
it  is  possible  to  fit  up  to  7  general  purpose  2  MIP 
processors  in  a  single  slot  of  a  personal  computer.  Over  50 
MIPS  could  be  available  in  a  workstation.  Even  more 
computational  power  is  feasible  with  more  specialized 
processors. 

.  In  any  multi-processor,  multi-knowledge-source 
architecture  a  prime  consideration  is  the  communication 
between  the  knowledge  sources  and  the  communication 
between  the  processors.  As  will  be  discussed  later,  the 
software  architecture  that  Dragon  has  adopted  provides  for 
a  flexible,  but  very  structured  and  controlled 
communication  between  the  knowledge  sources.  The 
principal  strategy  which  is  used  to  reduce  the  amount  of 
communication  between  processors,  is  to  have  each 
processor  have  sufficient  processing  power  and  a  sufficient 
amount  of  local  memory  to  implement  one  or  more 
complete  knowledge  sources. 

Choosing  an  architecture  in  which  knowledge  sources 
and  processors  are  somewhat  loosely  coupled  leads  to 

different  issues  and  different  research  questions  than  a 

more  tightly  coupled  architecture.  Thus  code  vectorization 
at  e«/iavwtifcg  jcaui  code  to  parallel  code,  which  might  oe 
critical  in  a  tightly  coupled  architecture,  are  insignificant 
in  this  architecture.  On  the  other  hand,  partitioning  the 
knowledge  into  separate  local  memories,  which  is 
unnecessary  in  an  architecture  in  which  every  processor 
has  ’•-mediate  or  near-immediate  access  to  every  memory 
location,  is  a  critical  issue  in  this  hierarchical  architecture. 
However,  since  it  is  a  prime  concern  of  this  research 

project  to  study  how  knowledge  sources  of  different  r-nes 

can  work  together,  it  is  entirely  appropriate  to  choose  a 
hardware  architecture  in  which  the  most  important 
implementation  issues  occur  at  a  similar  level  to  the 
important  issues  in  the  functional  architecture. 


Note  also  that  with  each  knowledge  source  located  on 
*  single  general  purpose  processor,  it  is  practical  to  do 
much  of  the  development  work  for  a  knowledge  source 
independently  in  a  .tand-alone  environment,  and  still 
easily  link  the  knowledge  source  into  the  rest  of  the 
system. 

The  specific  implementation  that  Dragon  Systems  has 
designed  uses  up  to  s<V'  .  80286  processors  on  a  mother- 
daughter  board  combination  in  a  single  slot  in  a  high-end 
MS-DOS  personal  computer.  There  is  876K  of  local 
memory  for  each  processor  and  also  64 K.  of  memory 
shared  by  all  processors  on  the  board  and  also  by  the  host 
processor  in  the  personal  computer.  The  local  system  bus 
is  essentially  a  standard  Multibus  restricted  to  the  local 
board.  Dragon’s  multi-processor  board  and  its  interface  to 
the  host  CPU  is  described  in  much  greater  detail  in  the 
sepuate  document  "Multi-Processor  Board." 

Code  written  for  this  design  should  be  upward 
portable  to  a  design  using  80386  processors  with  no 
recoding  at  all  k  ihould  be  portaUe  fc.  vurtarattfun  using 
other  peripheral  buses  (Multibus,  VME-bus  or  Unibus)  and 
other  operating  systems  (UNIX  or  VMS)  with  only 
moderate  redesign  and  recoding.  All  of  the  knowledge 
source  code,  in  particular,  runs  independently  (within  the 
specifications  of  the  software  architecture)  on  a  single 
processor.  An  individual  knowledge  source  is  implemented 
independently  of  the  higher-level  hardware  itructure. 

Since  MS-DOS  is  not  multi-tasking  and  not  re¬ 
entrant,  we  have  implemented  a  multi-tasking  monitor  to 
handle  the  low-level  communication  and  tynchronization 
between  the  processors  and  to  simulate  multi-processors  on 
a  single  processor.  Detailed  specifications  for  these 
routines  are  available,  but  they  will  not  be  discussed 
further  in  this  paper. 

It  also  should  be  pointed  out  that  although  the  multi¬ 
processor  design  has  been  completed,  only  a  one-orocessor 
prototype  has  been  constructed  so  far.  Also,  the 
benchmark  software  development  has  been  done  in  a 
personal  computer  environment  with  limited  memory,  so 
even  though  the  software  architecture  is  intended  for  a 
multi-processor,  multi-tasking  environment,  the  current 
implementation  runs  non-real-time  on  a  single  processor 
without  simulating  the  low-level  communication  details. 


Software  Architecture 

The  overall  architecture  of  the  multi-knowledge 
•oueee  sytietn  require?  t  <•»—  diutoctlao  between  tm 
concepts  of  "knowledge"  and  of  "data."  "Knowledge* 
should  be  thought  of  as  permanent  information,  such  as 
properties  of  speech  or  facts  of  linguistics.  "Data"  is  the 
information  that  has  been  computed  about  a  particular 
utterance.  "Data"  is  passed  around  among  the  knowledge 
sources  and  is  used  in  the  recognition  process,  but  unless  it 
is  converted  to  "knowledge,"  it  is  not  permanently  stored. 
'Knowledge,"  on  the  other  hand,  is  not  snared.  AH 
knowledge  is  local  to  a  particular  knowledge  source. 

It  is  imjnrtnnt  to  notice  that  these  definitions  are  not 
merely  definitions  to  distinguish  the  two  kinds  of 
information.  Splitting  all  information  into  these  two 
categories  deliberately  imposes  very  significant  limitations 
on  the  overall  system.  "Knowledge"  cannot  be  shared 
among  knowledge  sources;  "data"  cannot  be  saved 
permanently.  Although  seme  exceptions  are  allowed  for 
efficiency,  the  distinction  between  "knowledge"  and  "data" 
is  deliberately  enforced  to  enhance  the  modularity  of  the 
knowledge  sources. 


For  example,  if  two  different  knowledge  sourctr 
both  need  models  for  the  "expected"  formant  frequencies 
of  each  steady-state  vowel  (say  one  that  is  recognizing 
coasonants  from  the  formant  transitions  in  adjacent 
vowels,  and  one  that  is  recognizing  the  vowels  themselves), 
then  they  should  each  have  their  own  local  copy  of  that 
knowledge.  They  can  share  the  "data"  about  the  formant 
frequencies  estimated  during  a  particular  utterance,  but 
they  must  each  have  their  own  local  copy  of  the  permanent 
"knowledge."  If  they  each  have  their  own  local  models, 
each  knowledge  source  would  still  work  if  the  other 
knowledge  source  were  replaced  by  another  knowledge 
source  that  used  a  completely  different  modeling  method. 

A  knowledge  source  cannot  directly  call  a  function  in 
another  knowledge  source.  All  communication  is  done  by 
posting  data  on  the  "bulletin  board."  The  system  enforces 
a  very  stinog  degree  of  modularity  on  the  knowledge 
sources. 

In  these  restrictions,  however,  training  is  logically 
separated  from  recognition.  Trailing,  if  njeessary,  can 
run  offline  using  uuia  fui  has  ieuipuTam>  uttu  oa  <  tu 
files.  Related  knowledge  sources  that  might  be  executing 
on  separate  processors  at  recognition  time  can  be  put  on  a 
single  processor  and  can  make  direct  calls  to  functions  in 
other  knowledge  source  modules.  This  mechanism  should 
be  used  sparingly  and  not  abused,  but  it  is  open-ended 
enough  to  allow  any  training  algorithm  implementation  that 
is  consistent  with  good  structured  program  practice: 
Training  does  not  have  to  follow  the  stricter  discipline  that 
is  necessary  for  real-time  computation  on  parallel 
processors. 

How  to  do  global  training  in  a  system  that  has  many 
different  kinds  of  knowledge  sources  is  a  very  complex 
and  intriguing  question.  In  particular,  it  is  an  open 
research  question  as  to  how  to  combine  knowledge  sources 
that  use  fully  automatic  training  with  knowledge  sources  in 
which  the  training  process  involves  interaction  with  a 
human  expert.  However,  out  investigation!  etc  Mid  at  a 
very  preliminary  stage.  So,  even  though  dicussion  of  this 
issue  would  be  very  welcome,  it  is  not  covered  in  this 
paper. 

The  discusion  will  now  focus,  therefore,  on  the 
loading  of  knowledge  and  the  communication  of  data  at 
recognition  time.  Five  functions  are  specified  for  the 
communication  of  knowledge  and  data:  three  entry  points 
that  the  knowledge  source  provides  to  the  system  (ks_load, 
ks^call,  ks_unload,  where  "ks”  would  be  replaced  by  a 
unique  cl  :'r«.cter  string  identifying  the  particular 
knowledge  source)  and  two  system  functions 
Ooad_gef_Knowieage  ana  posi_get_aata;  tnat  the 
knowledge  source  calb  to  get  the  actual  knowledge  or  data. 

The  parameters  and  calling  specifications  for  these 
functions  is  given  in  the  Appendix. 

From  the  point  of  view  of  an  individual  knowledge 
source,  activity  is  divided  into  three  narts:  11  loading  and 
initializing  the  knowledge  source,  2)  the  actual  processing 
of  utterances,  and  3)  cleaning  up,  freeing  memory  and 
unloading.  Loading  and  unloading  are  mainly  used  in 
mrttirvMin  fa  which  not  til  of  tin  imMgt  w«l  fit  k, 
available  memory.  In  the  multi-processor  configuration, 
with  a  sufficient  number  of  processors,  all  knowledge 
sources  will  be  loaded  at  system  initialization  and  would 
not  need  to  be  unloaded.  The  discussion,  therefore  will 
focus  on  ks_call  and  the  processing  of  utterances  to  be 
recognized. 
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When  a  knowledge  source  is  called,  the  only  input 

oarameter  is  "time."  The  parameter  "p _ post _ list"  is  a 

pointer  that  the  knowledge  source  sets  to  point  to  the  first 
item  in  a  list  of  items  to  be  petted  on  the  bulletin  board, 
and  "p_done_time"  is  an  output  value  that  the  knowledge 
source  sets  to  tell  the  system  that  it  is  finished  up  through 
the  indicated  time.  The  central  data  structure  is  called  a 
"bulletin  board"  rather  than  a  "blackboard,"  because 
previously  posted  items  cannot  be  modified. 

The  knowledge  source  gets  its  actual  input  data  by 
calling  the  function  "post_get_data."  This  "input  data  on 
demand,"  lets  an  individual  knowledge  source  request  only 
the  data  that  it  needs  rather  than  all  the  data  the  system 
has  available,  reducing  the  inter-processor  communication. 
The  system  keeps  track  of  the  data  sources  that  provide 
input  to  a  given  knowledge  source.  The  system  does  not 
issue  the  "ks_call"  for  a  particular  /alue  of  "time"  until  all 
the  input  data  is  ready.  Thus  the  knowledge  source  knows 
that  it  can  call  "post_get_data"  for  any  of  its  input  data 
sources  for  any  time  up  to  and  including  the  current  value 
of  "time." 

It  is  immediately  apparent  that  this  manner  of  calling 
the  knowledge  sources  imposes  a  timewise  "left-to-right" 
order  on  the  processing  of  each  utterance.  This  is  one  of 
the  compromises  that  has  been  made  to  keep  the  system  as 
simple  as  possible  in  some  ways  in  order  to  make  it  as 
flexible  and  general  as  possible  in  others.  Note  that,  since 
"p_done_time"  may  lag  behind  "time,"  each  knowledge 
source  may  internally  create  a  look-ahead  buffer  of 
arbitrary  duration.  Although  there  are  some  potential 
knowledge  sources  which  might  be  very  inefficient  given 
this  constraint  of  timewise  processing,  it  is  a  reasonable 
constraint  for  a  real-time  system.  The  main  limitation 
imposed  by  this  processing  method  is  in  the  possible 
implementations  of  syntax  control,  semantics,  and  language 
modeling  generally,  since  any  context  dependence  of 
duration  less  than  a  couple  words  can  be  easily  handled 
with  a  look-ahead  buffer.  There  are  at  lea3t  some  parsing 
and  language  modeling  methods  that  can  work  well  within 
this  constraint,  with  ?  limited  loo^-ah^ad. 

For  the  input  data,  a  strict  "timewise"  sequence  is 
imposed  on  a  knowledge  source.  Once  a  knowledge  source 
has  called  "post_get_data"  for  a  particular  time,  it  cannot 
go  back  to  any  earlier  time.  If  it  wants  to  reuse  an  item, 
it  must  buffer  it  internally.  The  output  is  not  as 
restricted,  the  knowledge  source  can  post  data  for  any 
"post_time"  greater  than  any  previous  "p_done_time." 
The  system  is  responsible  for  buffering  the  output  of  the 
knowledge  source  until  other  knowledge  sources  have  used 
it. 

The  constraints  of  this  functional  architecture  thus 
are  as  follows: 

1)  Knowledge  is  local,  not  shared 

2)  Data  is  temporary,  not  saved 

3)  Utterances  are  processed  timewise 

Implicit  constraints  include: 

4)  Each  knowledge  source  should  use  only  a  small 
fraction 

of  its  computation  time  communicating 

data 

5)  Each  knowledge  source  should  operate  in  real-time 
on 

a  single  2  MIP  processor  with  876K  of 

local 

memory 

6)  To  minimize  response  time,  knowledge  sources 
should  be 

designed  to  use  data  as  soon  as  possible 

after 

it  becomes  available. 


The  "fraction"  in  constraint  (4)  must  be  smaller  (in 
the  range  5-10%)  for  this  architecture  than  for  some 
architectures,  to  prevent  the  local  system  bus  from 

becoming  a  bottleneck.  However,  communication  between 
several  small  knowledge  sources  clustered  on  a  single 
processor  doesn’t  count  against  this  constraint.  Since 

Dragon  Systems  has  demonstrated  real-time  large 

vocabulary,  natural  language  isolated  word  recognition  on  a 
single  1  MIP  processor,  it  is  believed  that  constraint  (5)  is 
not  too  great  a  limitation  on  the  complexity  of  an 

individual  knowledge  source. 

A  knowledge  source  that  requires  a  lot  of  processing 
but  that  doesn’t  use  too  much  memory  can  be  easily  be 
partitioned  to  run  on  more  than  one  processor  simply  by 
making  duplicate  copies  of  the  local  "knowledge."  The 
greatest  design  constraint  is  for  knowledge  sources  that 
need  more  than  876K  of  local  memory  for  program  code 
plus  their  "knowledge."  Such  knowledge  sources  must  be 

partitioned  to  run  as  separate  knowledge  sources  on  more 
than  one  processor,  without  the  ability  to  share  knowledge 
among  the  partitioned  knowledge  sources. 

It  is  easy  to  see  that  the  constraints  are  all  very  broad 
and  not  specific  to  the  kinds  of  knowledge  sources 
involved.  In  this  framework  a  "knowledge  source"  is  any 
module  of  subroutines  that  satisfy  the  necessary  constraints 
to  be  local  to  a  processor.  The  module  need  not  deal  with 
what  would  conventionally  be  called  "knowledge."  Thus  an 
FFT  routine  would  be  a  "knowledge  source"  as  long  it 
followed  the  calling  conventions  and  received  and  sent  all 
its  data  by  posting  on  the  bulletin  board.  The  FFT  routine 
would  have  an  empty  set  of  "knowledge,"  even  in  the 
formal  sense,  unless  the  coefficient  table  was  loaded  as 
"knowledge."  Thus  a  "knowledge"  source  need  not  actually 
have  any  "knowledge." 

On  the  other  hand,  a  knowledge  source  could  be  very 
complex.  It  could  be  a  complete  rule-based  phoneme 
recognizer  or  a  complete  hidden  Markov  model  word 
recognizer.  Some  knowledge  sources  could  be  "translators" 
that  would  allow  knowledge  sources  of  different  types  to 
cooperate  with  each  other.  With  this  functional 
architecture  it  will  be  possible  to  run  "plug-and-replace" 
experiments  with  several  different  versions  of  the 
knowledge  source  that  does  a  particular  tssk.  The  other 
knowledge  sources  will  not  need  to  be  explicitly  aware  t  ' 
which  of  the  experimental  knowledge  sources  is  in  the 
system. 

As  a  matter  of  good  programming  practice,  it  is 
generally  preferable  to  break  the  recognition  task  up  into  a 
larger  number  of  simpler  knowledge  sources.  Thus 
different  knowledge  sources  might  specialize  on  different 
phoneme  classes  rather  than  being  combined  into  a  single 
knowledge  source.  Each  knowledge  source  should  be  only 
a  few  hundred  lines  of  code  in  a  higher  level  language. 

Current  work  is  proceeding  on  two  fronts:  within  this 
software  architecture  Dragon  is  implementing  a  complete 
connected  word  recognizer  as  a  feasiblity  proof  of  the 
knowledge  source  partitioning  and  the  timewise  processing 
and  as  a  platform  for  studying  communication  and 
performance  bottlenecks.  Dragon  is  also  implementing 
"benchmark"  versions  of  a  variety  of  novel  algorithms,  e.  g. 
neural  networks,  to  see  how  they  might  be  incorporated 
into  this  framework.  Dragon  also  invites  other  DARPA 
sites  to  submit  knowledge  sources  in  either  source  code  or 
object  code  form.  The  greater  the  variety  of  knowledge 
sources  that  can  work  together  in  a  cooperative 

environment,  the  greater  will  be  the  benefit  for  the  whole 
speech  recognition  research  community. 
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Appendix 

BOOL  ks  load(ks,data  size,data_handle) 

7*  Returns  YES  if  already  loaded  */ 
unsl6  ks;  /*  The  unique  handle  the  system  has  assigned  to 
this  knowledge  source,  used  when  calling 
load  get  knowledge.  */ 

*/ 

int32  data_size;  /*  The  size  in  bytes  of  the  block  of 
knowledge  that  the  system  has  available  to 
pass  to  this  knowledge  source  */ 
inti 6  data_handle;  /*  A  handle  that  the  system  has  assigned 

that  the  knowledge  source  should  use  when  calling 
load_get_knowledge  */ 

BOOL  ks  call(time,p_post,p_done_time) 

J*  Returns  YES  for  end-of-data  condition  similar  to  EOF 

*/ 

unsl6  time;  /*  Current  time  as  measured  by  the  system.  Being 
called  with  this  value  of  "time"  tells  this 
knowledge  source  that  any  input  data  that  it 
all  knowledge  sources  that  send  input  data  to 
this  knowledge  source  have  reported  to  the 
system  that  they  are  done  posting  up  through  this 
value  of  time.  */ 

struct  POST_LIST  **p_post;  /*  A  pointer  to  a  list  of  items 
to  be  posted  on  the  "bulletin  board"  */ 
unsl6  *p_done_time;  /*  The  knowledge  source  tells  the  system 

that  it  has  finished  posting  all  items  to  be 
posted  at  time  up  to  and  including  p_done_time. 

V 
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struct  POST_LIST  { 

unsl6  size;  /•  actual  size  of  the  data  in  this  item  */ 
unsl6  post  time;  /*  the  time  slot  on  the  bulletin  board 
at  which  this  item  is  to  be  posted  */ 
struct  P0ST_L1ST  ’next  post;  /*  pointer  to  the  next  item 
to  be  posted,  if  any  V 
char  data[MAX_DATA_SIZE]; 

/*  The  actual  data,  which  may 
have  any  kind  of  information,  but  whose  internal 
structure  is  only  known  to  the  knowledge  sources  that 
use  it.  */ 

); 


BOOL  ks_unload() 

int32  load  get_knowledge(ks,data_size,data_area,data_handle) 

/*"  Return  th„  actual  number  of  bytes  sent  */ 
unsl6  ks;  /*  Knowledge  source  handle  */ 
int32  data_size;  /*  Size  of  buffer  area  in  which  to 
put  a  portion  of  the  knowledge.  */ 
char  *data_ar.;a;  /*  Pointer  to  buffer  area  */ 

intl6  data _ handle;  /*  A  handle  for  the  knowledge  data_area, 

comparable  to  a  file  pointer.  */ 

/*  (In  the  current  implementation  load^get_knowledge  is 
functionally  similar  to  a 
read(data_handle,data_area,data_size)  )  */ 

BOOL  post_get  data(handle,post_time,ship_to,ship_size,remain, 
complete,  eoT) 

/*  Returns  YES  if  there  more  to  come  for  the  current 
post_time  */ 

unsl6  handle;  /*  Handle  identifying  the  the  data  source  */ 
unsl6  post_time;  /*  The  posting  time  for  which  data  is 

requested.  Any  knowledge  source  for  which  ks_call  has 
called  with  a  particular  value  of  "time",  may  call 
post^get-data  if  any  value  of  post_time<-time.  */ 
char  *ship_to;  /*  Buffer  in  which  to  put  a  block  of 

the  data.  */ 

unsl6  *ship_size;  /*  On  input  the  size  of  the  ship _ to 

buffer 

On  output  the  actual  number  of  bytes 
sent  */ 

unsl6  ’remain;  /*  Number  of  bytes  remaining  in  the 

data  that  was  posted  for  the  current  time.  */ 

BOOL  ’complete;  /*  This  buffer  includes  the  end  of  a 
complete  item  */ 

BOOL  *eof;  /*  YES  if  there  is  no  more  data  (for  this  or 
any  greater  value  of  post_time  */ 
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Abstract 

We  have  conducted  speaker-independent  isolated  digit  recog¬ 
nition  experiment'  using  vector  quantized  cochleagrams.  With¬ 
out  the  use  of  time  order  information,  we  were  able  to  achieve  a 
recognition  rate  of  93.3%.  With  a  modified  Viterbi  algorithm  we 
achieved  a  rate  of  99.1%,  tn  a  test  with  a  larger  talker  popula¬ 
tion.  Since  these  accuracies  are  not  far  apart,  we  must  call  into 
question  the  effectiveness  with  which  the  Viterbi  algorithm  uses 
time  order  information.  These  results  demonstrate*  that  the  au¬ 
ditory  sprectrum  approach  lead,  to  high  performance  even  with 
simple  non-parametric  techniques  and  phoneme-level  word  mod¬ 
els.  The  results  presented  here  update  the  result  presented  at 
ICASSP  87  [Loeb  et  al.  ’87];  they  verify  the  prediction  that  ac¬ 
curacy  would  be  significantly  improved  by  doubling  the  training 
and  testing  talker  populations,  and  by  using  two  repetitions  of 
each  digit  from  each  talker  in  training. 


1  Introduction 

We  have  conducted  a  number  of  isolated  digit  recognition  ex¬ 
periments  in  an  effort  to  evaluate  the  potential  of  an  auditory 
model  front  end.  The  experiments  emphasize  non-parametric 
approaches  and  techniques  that  use  little  or  no  time  order  infor¬ 
mation.  We  wished  to  set  high  performance  standards  for  future 
experiments  while  es  timating  the  relative  importance  of  the  var¬ 
ious  sources  of  information  in  the  data. 

Although  the  front  end  for  our  experiments  is  a  cochlear  model 
[Lyon  ’82],  there  is  nothing  explicitly  neural  about  our  tech¬ 
niques.  They  could  be  applied  to  any  other  vector  quantized 
representation.  Many  of  the  experiments  are  interesting  as  tech¬ 
niques  for  the  use  of  non-parametric  statistics  in  spite  of  the 
shortage  of  training  data.  Since  every  experiment  in  the  first 
group,  originally  presented  at  ICASSP  87  [Loeb  et  al.  87],  uses 
the  same  training  and  testing  data  sets,  those  results  are  directly 
comparable;  later  experiments  extend  some  results  to  larger  train¬ 
ing  and  testing  sets. 


The  bulk  of  this  paper  de'cribes  the  various  recognition  meth¬ 
ods  that  we  tested.  We  start  with  the  non- time-order  methods. 
These  include  a  "Basic  Method”,  four  modifications  to  it,  and  a 
section  on  vector  quantization  methods.  After  that  we  describe 
our  two  time-order  methods.  Just  before  we  conclude,  we  inter¬ 
pret  some  of  our  methods  and  results  in  terms  of  neural  networks. 

2  General  Methods 

The  first  repetitions  of  the  isolated  digits  in  the  training  sub¬ 
jet  of  the  TI  Connected  Digit  Database  (sampling  rate  20kHz) 
were  analyzed  by  our  cochlear  model.  The  model  produces  a 
discrete-fime  92-channel  spectrum,  which  is  down-sampled  to  1 
kHz  and  quantized  by  a  tandard  Euclidean  quantizer  with  1024 
codewords.  The  quantizer  codebook  wao  trained  on  the  first  rep¬ 
etitions  of  all  the  training  speakers  using  the  standard  K-means 
algorithm. 

In  th^  first  group  of  experiments  to  be  described,  half  of  the 
112  training  speaker  were  used  for  training  and  the  other  half 
were  used  for  te  ting.  Thus  the  recognition  results  in  these  ex¬ 
periments  are  ostensibly  speaker-independent.  We  cannot  claim 
total  speaker-independence  because  we  used  both  sets  of  speak¬ 
ers  to  build  our  vector  quantizer.  Since  this  caused  two  of  the 
codewords  never  to  occur  in  our  training  set,  the  net  result  is 
actually  poorer  performance  than  we  find  in  extending  to  more 
speakers. 

In  later  experiments,  up  to  four  times  as  much  training  data 
was  used,  by  u  ing  all  112  training  speakers,  with  two  repetitions 
from  each.  One  repetition  of  each  digit  from  each  testing  speaker 
has  been  analyzed  so  far. 

3  Definitions 

codeword  an  integer  in  [0,  B]  where  B  <  1023.  Each  codeword 
corresponds  to  some  subset  of  Si92,  and  the  set  of  codewords 
corresponds  to  a  partition  of  St92. 

utterance  the  sequence  of  codewords  derived  from  the  cochlea- 
gTam  of  one  of  the  speakers  saying  one  of  the  vocabulary 
words. 

utterance  histogram  a  vector  H(A),  where  ^,(.4)  is  the  num¬ 
ber  of  occurrences  of  codeword  cw  in  utterance  .4. 

guess  the  index  to  a  vocabulary  word.  A  guess  is  the  result  of 
some  recognition  method  operating  on  a  test  utterance. 
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guess  vector  a  <  vocabulary- size  >-dimensional  vector  of  word 
log  probabilities.  If  a  recognition  method  generates  a 
gues  vector,  then  it  will  always  output  the  index 
of  the  most  prohr'  word  as  its  guess. 

4  The  Basic  M 

A  matrix  of  condi  lities  of  observations  (code¬ 
word-)  given  the  woi  first  generated.  Probabilities 

are  estimated  from  a  count  on  t  •  number  of  occurrences  of  each 
codeword  within  all  training  utterances  of  each  vocabulary  word. 
Then,  for  recognition,  each  word  in  the  vocabulary  is  scored  by 
adding  the  log  likelihoods  for  all  time  samples  in  the  unknown 
utterance;  since  these  scores  do  not  depend  on  the  order  of  oc¬ 
currence  of  the  samples,  they  are  most  easily  computed  by  mul¬ 
tiplying  the  log  probability  matrix  by  the  utterance  histogram. 

Now  each  codeword,  ctu,  indexes  a  vector  V (civ)  in  our  matrix, 
whose  ith  component  is  given  by 

K-(ctv)  =  —  log  Pr[codeword  ctu  |  word  t]. 

In  this  notation  the  same  guess  vector  can  be  formed  by  accumu¬ 
lating  the  V(ctv)’s  indexed  by  each  codeword  found  in  sequence 
in  the  test  utterance. 

This  simple  program  gives  94.97%  correct  r  cognition  on  the 
first  repetitions  (see  Table  1),  and  95.54%  on  both  repetitions 
(see  Table  2).  This  implies  that  the  codewords  (and  thus  the 
underlying  cochleagrams)  ar  doing  a  good  job  of  acoustically 
separating  our  vocabulary  words. 

It  is  interesting  to  note  that  an  earlier  version  of  the  codebook, 
in  which  the  K-mean  algorithm  had  not  iterated  to  convergence, 
gave  94.16%  recognition  on  the  first  reps.  This  indicates  that  the 
method  of  codebook  vector  production  is  an  important  compo¬ 
nent  of  a  quantizer-ba  ed  system. 

Finally,  when  we  used  the  basic  method  on  quantized  LPC 
spectra  we  achieved  88.31%  recognition.  The  LPC  codebook  had 
1024  codewords  made  by  the  K-means  Igorithm,  but  the  LPC 
quantizer  produced  one  codeword  every  :  0  msec.  When  we  down- 
sampled  the  cochlear  quantizer  to  the  same  rate  the  basic  method 
gave  93.83%  recognition.  We  repeated  this  test  with  eight  differ¬ 
ent  training  and  testing  populations  In  all  cases  the  number  of 
LPC  errors  were  roughly  double  the  number  of  cochlear  errors. 

5  Simple  Variations  on  the  Basic  Method 
5.1  Codeword  Grouping  -  OR  type 

Let  us  suppose  there  are  everal  codewords  covering  the  spec¬ 
tra  produced  by  /s/  sounds.  Then  the  majority  of  the  obser¬ 
vations  of  these  codewords  will  occur  in  the  vocabulary  words 
containing  /s /  sounds.  In  the  task  at  hand  these  words  are  six 
and  seven.  So,  if  we  assign  one  new  number  to  every  codeword, 
ctu,  such  that  most  of  the  observations  of  ctv  occur  in  the  words 
six  and  seven,  then  this  new  number  should  be  a  good  indicator 
of  the  /s/  sound. 

To  implement  this  idea  we  need  a  parameter  coverage-proportion. 
We  map  each  codeword,  civ,  to  a  list  of  the  vocabulary  words  that 
account  for  at  least  coverage-proportion p{  the  observations  of  ctu. 
Next  we  map  these  lists  to  integers  (i.e.,  we  number  them).  The 
composition  of  the  two  mappings  is  a  many  to  one  map  from  the 
original  codewords  to  tome  new  codewords.  We  then  use  the  new 
codewords  in  the  basic  method. 


There  are  two  important  variations  in  the  mapping  from  lists 
to  numbers.  We  make  an  unordered  grouping  by  numbering  the 
lists  as  sets  so  that  the  order  of  the  words  does  not  matter.  We 
make  an  ordered  grouping  by  numbering  the  list-  so  that  the 
order  of  the  words  does  matter. 

The  results  are  on  the  “Ba-ic”  row  of  Table  1  and  Table  2. 
In  Table  1  the  groupings  are  as sumed  to  be  unordered  (number 
as  sets).  In  Table  2  we  compare  the  unordered  grouping  with 
coverage-proportion  =  0.80  to  the  ordered  grouping  (number  as 
lists)  with  the  same  coverage-proportion. 

Although  this  method  is  of  marginal  use  in  improving  per- 
formanc  ,  it  does  reduce  the  number  of  codewords.  In  Table  1 
the  0.95  (unordered)  grouping  reduces  the  number  of  codeword, 
from  the  original  1024  to  405,  and  the  0.80  (unordered)  group¬ 
ing  hai  313  codewords.  In  Table  2  the  0.80  unordered  grouping 
has  308  codeword’.  This  is  different  from  the  ame  grouping  in 
Table  1  because  we  estimated  the  codeword  distributions  with 
both  repetitions.  The  0.80  ordered  grouping  in  Table  2  has  462 
codewords. 

Some  of  the  sets/lists  that  we  use  to  make  these  groupings 
have  very  clear  phonetic  content.  For  example  the  0.80  unordered 
grouping  of  Table  1  maps  71  of  the  original  codewords  to  the 
set  (six  ,  seven},  39  originals  to  the  set  (fAree  ,  zero},  and  22 
to  (one  ,  nine}.  Although  many  of  the  sets  do  not  have  such 
obvious  phonetic  content,  most  of  the  sets  that  represent  large 
numbers  of  original  codewords  do.  Thus  we  may  have  found  a 
way  to  generate  phonetically  meaningful  labels  without  imposing 
our  pre-conceptions  upon  the  data.  In  the  future,  we  hope  to 
extend  this  method  to  the  grouping  of  sequences  of  codewords. 


5.2  Histogram  Compression 

Without  changing  the  training  procedure,  we  map  the  his¬ 
togram  of  each  testing  utterance  to  its  log.  I.E  If  codeword  X 
occurs  N  times  in  testing  utterance  A,  then  the  contribution  of 
codeword  X  to  our  guess  vector  for  utterance  A  will  be  the  prod¬ 
uct  of  log  (1  +  N)  and  row  number  X  of  the  matrix  of  log  prob¬ 
abilities. 

Notice  that  we  can  achieve  approximate  log  compression  with¬ 
out  the  use  of  histograms  by  letting  the  n^  response  to  codeword 
X  be  1/n.  This  is  highly  reminiscent  of  habituation.  Thus  the 
histogram  compression  method  is  both  neurally  plauible  and  ex¬ 
tendable  to  continuous  speech  recognition. 

The  results  on  the  first  two  lines  of  the  result  tables  suggest 
that  more  codewords  are  better  for  this  method.  We  should  ex¬ 
pect  as  much,  since  compression  gives  more  equal  weight  to  each 
of  the  codewords  than  the  basic  method.  We  suspect  that  com¬ 
pression  offers  greater  improvements  than  any  of  the  codeword 
groupings  because  it  does  a  better  job  of  reducing  the  effects  of 
uninformative  codewords. 

When  compression  and  time  jplitting  (Section  7.1)  are  used 
together  we  begin  to  rival  our  best  Viterbi  algorithm  results. 
This  combined  method  is  very  simp  ad  uses  virtually  no  time 
information  at  all.  This  suggests  the  possibility  that  our  Viterbi 
algorithm  might  be  improved  by  the  addition  of  a  comi  ressive 
operation. 


llo 


5.3  Non-Occurring  Codewords  (or  Necessity) 

The  basic  method  will  often  guess  “zero”  when  the  input  is 
an  “oh".  The  re '.son  is  that  the  method  i.i  one  of  sufficiency  -  it 
has  no  way  of  necessitating  a  /z/  sound  before  guessing  “zero”. 
Thus  we  need  to  modify  the  basic  method  to  make  use  of  the 
codewords  that  do  not  occur. 

For  each  codeword,  cw,  that  does  not  occur  in  the  te-t  utter¬ 
ance  we  add  an  inverts  of  V(ctu)  (Section  4)  to  our  guess  vector. 
The  inverse  otV'(cw)  wa &  computed  by  subtracting  each  element, 
Vi(cw),  from  the  maximum  element  of  V (cw). 

If  wc  examine  the  differences  between  the  “Basic”  and  “Neces¬ 
sity"  result,  in  Table  1,  then  it  appears  that  this  method  becomes 
more  successful  as  the  number  of  codewords  decreases.  If  this 
were  simply  the  ca.;e,  however,  we  would  expect  an  even  larger 
improvement  in  a  0.50  unordered  grouping,  which  had  only  156 
codewords.  The  recognition  rate?  with  this  grouping  wer.  90.26% 
with  the  basic  method  and  91.56%  with  the  necessity  method.  So 
the  utility  of  the  necessity  method  seems  to  depend  on  the  extent 
to  which  our  codewords  correspond  to  phonetic  units. 

5.4  Several  Ranges  for  Each  Codeword 

The  basic  method  has  difficulty  distingui  hing  betw  en  "nine” 
and  "one".  We  would  expect  "nine"  to  have  roughly  twice  as 
many  milliseconds  of  /n/  as  "one",  but  the  basic  method  can  not 
take  advantage  of  this.  We  need  the  probability  vectors  V(cw) 
(Section  4)  to  be  junct  ions  of  the  numoer  oi  occurrences  of  cw  as 
well  as  of  cw. 

To  do  this,  we  used  the  basic  method  with  6  matrices.  If  we 
1(  t e  i/eMber  A  (liianeneei  A  cv-Aew  X  ha  flttenfiW 

A,  then  utterance  A  will  contribute  to  or  use  the  [log(l  +  2N)Jth 
matrix  for  codeword  X. 

Each  of  our  non- time-order  methods  is  a  function  that  maps 
every  point  in  the  space  of  utterance  histograms  to  a  log  dis¬ 
tribution.  The  improvements  shown  in  Table  1  may  be  due  to 
our  adding  more  detail  to  this  function.  Examination  of  Table  2, 
however,  shows  that  the  ranges  method  works  about  as  well  as 
the  compression  method  (Section  5.2),  so  the  this  method  may- 
work  because  it  allows  each  codeword  to  contribute  equally  to  the 
‘mar  decision.  Aite.naciveiy,  the  improvements  may  tfe  due  to  tue 
fact  that  we  are  now  taking  into  account  the  average  number  of 
occurrences  of  a  codeword  among  those  utterances  in  which  it 
occurs.  This  statistic  represents  durational  information. 

5.5  Codeword  Groupings  -  AND  type 

The  codewords  are  not  independent  It  is  therefore  interesting 
to  consider  higher  order  conditionals  such  as  the  probability  of 
word  W  given  codeword  X,  codeword  Y,  and  no  codeword  Z. 
The  number  of  conditionals  of  this  form  is,  however,  prohibitively 
huge.  We  thus  have  no  choice  but  to  concentrate  on  groups  that 
have  a  reasonable  chance  of  occuring. 

We  mapped  each  ranged  codeword,  RCW,  to  a  list  of  code¬ 
words  that  had  an  above-threshold  correlation  to  RCW  in  the 
training  data.  We  then  considered  each  of  these  lists  to  be  a 
codeword,  where  the  number  of  occuraces  of  a  list  is  the  geo¬ 
metric  mean  of  the  number  of  occurances  of  each  of  its  elements. 
We  applied  the  basic  method  (Section  4)  to  the  unique  multi¬ 
element  lists  and  added  the  resulting  guess  vector  to  the  guess 
vector  obtained  from  the  ranges  method  (Section  5.4). 


By  adding  the  guess  vectors  of  this  method  to  the  guess  vec¬ 
tors  from  the  (compressed,  original  codewords)  basic  method  we 
were  able  to  get  a  recognition  rate  of  98.3%.  Since  this  result 
was  totally  ad  hoc,  it  seems  reasonable  to  expect  that  a  carefully 
built  system  can  achieve  99%  recognition  with  no  time  order  in¬ 
formation  at  all. 

The  ie  AND- groupings  took  a  great  deal  of  time  and  memory 
to  implement.  We  consider  these  groups  to  be  the  equi\  t 
of  word  models  Since  this  method  produced  a  significant  im¬ 
provement  we  expect  we  will  be  able  to  find  a  simpler,  more 
effective  means  of  using  inter- codeword  correlations  to  generate 
useful  word  models. 

6  Vector  Quantization  Methods 

Let  us  suppose  we  have  a  speech  recognizer  box.  Its  input 
is  a  speech  waveform  or  tequence  of  observations,  which  may 
be  thought  of  as  a  vector.  Its  output  is  a  word,  which  may  be 
thought  of  as  a  scalar.  Thus  our  speech  recognizer  is,  in  fact,  a 
vector  quantizer.  Can  it  be  implemented  directly  as  one? 

To  cut  down  the  pattern  space  some,  we  use  binary  histograms 
(i.e.,  each  codeword  either  occurs  or  does  not  occur  in  the  test 
utterance)  with  Euclidean  distance,  and  the  standard  K-means 
algorithm  to  construct  a  codebook.  A  test  utterance  then  maps 
to  its  closest  codebook  vector,  which  in  turn  tells  us  which  vo¬ 
cabulary  word  to  guess  (the  one  that  most  frequently  mapped  to 
that  codebook  vector  in  training  ). 

In  a  second  experiment  codebook  vector  k  was  set  to  be  the 
centroid  of  all  the  trsdning  vectors  for  vocabulary  word  k.  In 
another  experiment  we  formed  32  ortho-normal  vectors  from  the 
32  codeb  fi  v  ,rt on  A  the  fay.  UTpWttiBtA.  Wc  thel  atul  r.ii 
basic  method  by  finding  the  projection  of  the  test  utterance  on 
each  of  these  vectors  and  summing  the  product  of  these  numbers 
and  the  appropriate  log  probability  vectors. 

In  the  first  experiment,  a  codebook  of  size  16  gave  recognition 
=  66.6%.  When  size  =  32,  recognition  =  76.5%,  and  when  size 
=  12S,  recognition  =  80.5% 

In  the  second  experiment,  with  one  codeword  pet  vucsbulaiy 
word,  we  got  92.69%  recognition.  This  gives  a  rough  idea  of  the 
efficacy  of  the  K-means  algorithm  in  approximating  the  “correct" 
decision  "Duunuai.es. 

The  third  experiment  gave  recognition  =  70.13%.  Since  the 
codebook  vectors  we  ortho-normalized  for  this  experiment  were 
the  same  32  vectors  used  in  the  32  vector  part  of  the  first  exper¬ 
iment,  it  is  clear  that  this  method  was  of  no  help  whatsoever. 

7  Time  Order  Methods 
7.1  Time  Splitting 

It  is  surprising  that  we  can  do  so  well  without  any  time  in¬ 
formation,  but  we  will  need  to  use  it  eventually.  As  a  simple 
extension  of  the  methods  we  have  tried  so  far,  we  use  the  basic 
method  on  the  first  and  second  time-halves  of  the  utterances. 
Thus  each  test  utterance  will  produce  two  guess  vectors  :  one 
for  its  beginning  and  one  for  its  end.  The  final  guess  vector  will 
be  the  sum  of  these.  In  a  second  experiment,  we  use  the  ne¬ 
cessity  look-up  method  (Section  5.3)  on  both  time-halves.  In  a 
third  experiment,  we  use  the  range  method  (Section  5.4)  onbuth 
time-halves.  Finally,  we  use  the  compression  method  ^Section 
5.2)  on  both  time  halves. 


As  in  Section  5.2,  the  fact  that  such  a  simple  method  could 
provide  so  much  of  an  improvement  (sec  Table  1),  confirms  that 
the  time  order  information  will  be  extremely  helpful  when  used 
properly.  The  results  of  the  combined  time  splitting  and  code¬ 
word  ranges  methods  are  respectable,  but  they  depend  far  too 
heavily  on  the  grouping  parameter  to  be  considered  useful. 

7.2  Viterbi  Algorithm 

The  Viterbi  algorithm  is  well  known  in  speech  recognition. 
We  have  applied  it  using  simple  finite-state  word  models  similar 
to  tho-e  used  by  Bush  and  Kopec  [Bush  ’85] . 

The  cost  metric  used  by  the  Viterbi  algorithm  in  finding  a 
best  model-based  segmentation  is  -log  Prfcodeword  Ltate],  as 
in  our  basic  method.  The  state  tables  wire  initially  trained  using 
segmentations  found  by  Bush  and  Kopec’s  LPC-based  recognizer; 
they  have  been  retrained  and  modified  to  improve  performance. 
In  comparing  the  fits  of  the  various  word  models,  wo  used  mea¬ 
sures  other  than  total  cost  (probability),  as  described  elsewhere 
[Lyon  ’87].  In  particular,  the  average  costs  in  each  state  were 
given  equal  weight,  rather  than  giving  equal  weight  per  unit  of 
time;  and  a  term  wa  added  to  account  for  the  probability  of  the 
duration  in  each  state,  after  the  beot  fit  to  each  model  was  found. 

The  scores  reported  in  Table  1  are  the  best  of  several  vari¬ 
ations.  Other  variations  on  the  scoring  function,  for  example 
using  total  Viterbi  cost  or  omitting  durational  probabilities,  re¬ 
sulted  in  up  to  twice  as  many  errors.  We  were  able  to  reduce 
the  erTor  rate  from  1.62%  to  0.91%  using  the  same  codebook  but 
twice  as  many  training  repetitions  (still  testing  only  on  first  rep¬ 
etitions).  Testing  on  both  repetitions  from  the  other  56  speakers 
increases  the  erTor  rate  to  1.46%;  this  doubling  of  training  and 
testing  data  leads  to  error  rates  as  low  as  1.70%  in  the  best  of 
the  non-time-order  te.ts. 

It  is  interesting  that  the  finite-state  models  give  a  performance 
that  is  at  best  only  slightly  better  than  the  techniques  that  use 
little  or  no  time  sequence  information.  Better  techniques  for 
handling  timing  and  dynamics  are  clearly  still  needed. 

In  a  later  test,  training  talkerr  were  grouped  into  two  clusters 
based  on  codeword  occurrence  histograms  accumulated  across 
all  of  their  utterances,  and  separate  state  models  were  trained 
on  each  cluster.  It  was  found  that  the  two  clusters  (found  by 
K-means  algorithm)  partitioned  the  talkers  almost  perfectly  into 
males  and  females.  Recognition  using  both  sets  of  models  pro¬ 
vided  no  significant  difference  from  using  a  single  set  of  models, 
contrary  to  our  positive  experience  with  separate  male  md  fe¬ 
male  models  in  an  LPC-based  digit  recognizer.  This  is  probably 
due  to  the  fact  that  the  vocalic  auditory  spectra  resolv;  har¬ 
monics  enough  to  distingui  h  high  and  low  pitches,  so  that  the 
simple  non-parametric  techniques  already  separate  male  from  fe¬ 
male  fairly  well. 

Finally,  training  on  two  repetitions  from  all  112  training  talk¬ 
ers  and  testing  on  113  new  talkers  (first  reps  only),  the  error  rate 
is  reduced  to  only  0.89%  (11  errors  in  1243).  The  second  reps 
are  expected  to  lead  to  up  to  twice  as  many  errors,  based  on  our 
experience  with  the  112  training  talkers,  so  the  net  system  per¬ 
formance  is  estimated  at  not  worse  than  33  errors  in  2486  (which 
should  be  compared  with  TI’s  best  published  result  of  only  14 
errors  in  2486  tokens  [Bochieri  et  al.  ’86]). 
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Time  Splitting 
. . .  with  Necessity 

. . .  with  Ranges _ 

Viterbi  Algorithm 


OR- 
Original  [ 
94.97% 
96.75% 
94.16% 
94.81% 
96.10% 
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95.78% 
98.38% 
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0.95 
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95.62% 

97.24% 
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95.29% 

94.64% 

96.10% 

95.94% 

96.27% 

97.56% 

97.56% 


Table  1:  Recognition  percentage  for  616  test  utterances.  Group¬ 
ing  numbers  are  the  coverage-proportion  values  from  Section  5.1 
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_ Method _ Original 

Basic  Method  95.54% 

. . .  with  Compression  97.40%  ' 

. . .  with  Necessity  94.40% 

. . .  with  Ranges  97.56% 

. . .  with  AND-Grouping  — 

Time  Splitting  96.92% 

. . .  with  Compression  98.38%  ' 

■ . .  with  Ranges _  97.65%  ' 

Viterbi  Algorithm  98.54% 


OR-Grouping 
0.80  Unordered  | 
96.19% 
96.02% 
96.83% 


96.92% 

57.32% 


0.80  Ordered 
96.02% 
96.59% 
96.43% 
97.48% 
98.05% 
96.59% 
97.73% 
98.05% 


Table  2:  Recognition  percentages  for  1232  test  utterances.  Half 
of  these  utterance  were  used  for  Table  1.  The  other  half  are  the 
second  repetitions. 

8  Neural  Equivalents 


The  four  variations  to  the  basic  method  were  produced  by  a 
simple  neural  modelling  p  aradigm.  Given  a  numerical  system  like 
the  Basic  Method  we  describe  the  system  as  a  neural  network, 
find  some  way  to  make  this  network  more  realistic,  and  then  test 
the  numerical  system  version  of  the  more  realistic  neural  network. 

We  consider  the  Basic  Method  (Section  4)  to  be  equivalent  to 
a  neural  network  in  which  each  codeword  corresponds  to  an  input 
neuron,  and  each  vocabulary  word  corresponds  to  an  output  neu¬ 
ron.  Under  this  equivalence,  input  neurons  fires  once  each  time 
their  codewords  occur  in  the  test  utterance.  Input  neuron  cw  is 
connected  to  output  neuron  >  by  a  linear  excitatory  synapse  of 
weight  Vi(cw).  Output  neurons  sum  their  inputs,  and  the  number 
of  the  cell  with  the  lowest  value  (most  probable)  is  our  guess. 

The  OR-groupings  (Section  5.1)  wire  inspired  by  a  kind  of 
connectionist  model  in  which  each  input  neuron  excites  the  set  of 
output  neurons  with  which  it  is  associated.  Each  output  neuron 
then  excites  its  input  neurons.  We  believed  that  such  a  system 
would  ultimately  make  indistiguishable  the  input  neurons  that 
belong  to  the  ,ame  set  of  output  neurons. 

Two  of  the  other  variations  werj  motivated  by  the  fact  that 
neurons  are  not  imple  linear  devices.  For  the  Compression 
method  (Section  5.2)  we  considered  that  the  response  curves  of 
r  neurons  loo1-  like  bounded  logarithms.  We  thought  of  the 
Ranges  method  (Section  5.4)  when  we  considered  that  real  neural 
.  pools  tend  to  divide  the  numbers  they  encode  into  approximate 
log  ranges  [Brooks  ’86,  Chapters  3  and  4],  Thus  for  »ach  code¬ 
word  we  would  expect  one  neuron  to  fire  when  the  codeword  does 
not  occur,  one  to  be  sensitive  to  a  small  number  of  occurrences, 
mother  to  fire  in  proportion  to  a  larger  range  of  occurrences,  and 
mother  that  only  fires  during  large  inputs. 
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9  Conclusions 

It  ic  clear  that  our  cochlear  model  provides  an  adequate,  if  not 
superior,  spectral  representation.  The  unexp  ctedly  good  perfor¬ 
mance  of  the  simple  methods  implies  that  the  cochleagrams  are 
effectively  separating  phonetic  units.  Furthermore,  the  cochlea- 
gramt  appear  to  be  better  at  separating  phonetic  unit..  *han  LPC 
spectra. 

The  results  of  our  other  non- time-order  experiments  are  in¬ 
triguing.  We  now  know  that  there  exists  a  vector  quantizer  that 
maps  utterance  histogram^  to  words  with  at  least  98.3%  accuracy 
(Section  5.5).  Although  our  direct  quantizer  experiments  are  in¬ 
complete,  it  appears  as  though  the  K-mean.  algorithm  cannot 
reach  this  level  of  accuracy.  If  that  is  the  ca..e,  then  we  can 
perhaps  achieve  a  substantial  improvement  by  using  a  better 
quantizer  on  the  cochleagrmts.  By  analogy  with  our  recogni¬ 
tion  methods,  we  expect  that  we  can  improvement  the  K-means 
algorithm  by  assigning  different  weights  to  each  dimension  and 
by  feeding  back  the  distribution  of  each  vocabulary  word  with 
respect  to  each  codeword. 

It  is  important  to  note  that  we  hav  many  results  ranging 
from  9%  to  90%.  The  methods  reported  here  are  methods  that 
worked,  and  the  methods  that  worked  have  gen  rally  been  neu- 
rally  motivated.  At  worst  we  have  a  way  of  thinking  about  the 
problem  that  leads  to  some  solutions.  At  best,  the  brain  usus  the 
only  possible  solution,  and  we  are  using  a  non-random  means  of 
finding  that  solution. 

The  most  difficult  remaining  question  is  how  to  handle  time 
order  information.  The  proximity  of  our  Viterbi  results  to  the 
results  that  used  little  or  no  timj  order  information  forces  us 
to  conclude  either  that  time  order  information  is  not  so  useful 
as  had  been  thought,  or  that  the  Viterbi  algorithm  with  simple 
word  models  does  not  use  it  very  effectively.  If  we  suppose  that 
the  AND-grouping  method  (Section  5.5)  uses  most  of  the  time 
information  -  namely,  the  codeword  durations  and  the  codeword 
co-occurances,  then  it  seenk  that  the  actual  order  in  which  code¬ 
words  occur  is  not  terribly  important.  On  the  other  hand,  this  is 
the  only  time  information  available  to  th'  ’ime  splitting  method 
(Section  7.1),  which  was  able  to  out-perfoi-a  all  methods  except 
the  Viterbi  algorithm.  Thus  we  cannot  call  any  particular  piece 
of  information  crucial.  In  this  situation  our  best  option  is  to  go 
back  and  insure  that  each  component  of  our  system  is  exception¬ 
ally  well  done.  For  this  reason  we  believe  that  the  next  set  of 
experiments  should  involve  different  quantization,  of  the  original 
cochleagrams. 
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ABSTRACT 

Speech  recognition  algorithms  employing  a  similarity  mea¬ 
sure  between  the  input  -speech  utterance  and  the  stored  reference 
patterns  to  determine  recognition  of  a  word/sentence  are  compu¬ 
tationally  intensive.  The  instant  meous  vocabulary  size  that  can 
be  handled  in  real-time  is  relatively  small.  This  limitation  can 
be  alleviated  by  either  usin'  multiple  programmable  processors 
or  by  using  special  purpose  hardware  to  handle  the  computation¬ 
intensive  tasks.  In  a  research  environment  the  former  approach 
is  preferred,  because  improvements  to  the  algorithm  can  rapidly 
b_  incorporated  and  their  effects  studied  in  real-time.  T*xas  In¬ 
struments  has  developed  a  multiple-processor  architecture  based 
on  the  TMS32020  DSP,  called  Odyssey,  that  interfaces  with  Ex¬ 
plorer,  a  symbolic  computer.  This  paper  addreises  the  issues  in¬ 
volved  in  partitioning  and  allocating  tasks  in  a  multiple-processor 
environment  to  mPximL-  throughput,  and  discusses  the  imple¬ 
mentation  of  a  grammar-driven  >pe Jeer-dependent  connected- 
word  recognizer  (GDCWR)  as  an  example  application  that  use* 
the  power  of  multiple  processors. 

Introduction 


this  loading.  Texas  Instruments  has  developed  a  multiple  pro¬ 
cessor  architecture  called  Odyssey.  In  a  research  environment, 
the  multi-processor  programmability  is  extremely  desirable  since 
such  an  architecture  can  be  usud  as  a  protype  to  test  and  evaluate 
advanced  robust  speech  recognition/DSP  algorithms. 

The  Odyssey  system  [2]  is  an  expandable,  multiple  digital  sig¬ 
nal  processor  (DSP)  architecture  based  on  the  TMS32020  pro¬ 
grammable  microcomputer^].  Key  features  of  the  board  are: 
20  million  multiply /accumulates  per  econd,  5I2K  bytes  of  data 
space,  and  expandability  to  16  boards  on  a  NuBus  host. 

The  Odyssey  host  is  Texas  Instruments’  Explorer[4],  a  LISP 
machine  workstation.  Software  has  been  provided  which  extends 
the  high  productivity  environment  of  the  Explorer  into  the  area 
of  digital  signal  processing.  This  provides  an  environment  to 
perform  many  intelligent  signal  processing  tasks  by  associating 
meaningful  relationships  between  quantitative  (signal  process¬ 
ing)  and  qualitative  (symbolic  processing)  entities  to  develop 
inferences  using  expert  system  technology.  Applications  such 
as  grammar-driven  connected-speech  recognition,  neural  network 
simulation,  and  generation  of  speech  with  natural  language  gen¬ 
eration  techniques  are  some  of  the  tasks  that  can  utilize  the  com¬ 
putation!  power  of  the  mutiple  DSP  and  symbolic  processing. 


Many  speech  recognition  algorithms  extract  a  feature  vector 
from  the  input  signal  at  a  rate  of  25  to  50  times  per  second 
[1].  Vocabulary  words  may  then  be  represented  by  sequences 
of  feature  vectors  each  representing  the  spectral  content  of  the 
signal  over  a  short  period  of  time  called  a  frame.  In  the  recog¬ 
nition  process,  new  feature  vectors  are  computed  at  the  frame 
rate(25  to  50  Hz)  and  compared  to  every  reference  vector  in  ev¬ 
ery  vocabulary  word.  Comparison  involves  Euclidean  distances 
between  N-element  vectors,  where  N  is  typically  between  10  to 
20,  and  dynamic  programming  to  optimally  time-align  reference 
vectors  with  the  input  speech  vector.  This  process  is  computa¬ 
tionally  demanding  and  limits  the  size  of  the  active  vocabulary 
that  can  be  processed  in  real-time.  One  way  to  overcome  this  lim¬ 
itation  is  to  use  multiple  programmable  processors  to  distribute 


Grammar-Driven  Connected- Word  Recognizer 
(GDCWR) 

Figure  1  shows  a  block  diagram  of  '>e  GDCWR  system  [5]. 
An  isolated  recognizer  outputs  all  the  words  that  are  hypoth¬ 
esized  along  with  their  corresponding  distance  scores  and  esti¬ 
mated  durations.  The  basic  technology  of  the  word  recognizer  is 
a  modification  of  the  original  Texas  Instruments  LPC-based  iso¬ 
lated  word  recognition  system[5].  The  sentence  hypothesizer  con¬ 
structs  probable  sentences  from  the  word  hypotheses  and  their 
time  marks,  and  invokes  grammatical  constraints  to  consider  only 
the  admissable  paths  to  output  a  recognized  sentence  with  the 
lowest  distance  score.  The  list  of  possible  zubsentences  is  pruned 
to  minimize  both  memory  and  processing;  requirements.  The 
distance  measure  for  the  sentence  has  three  parts:  the  first  com- 
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Figure  1:  Block  diagram  showing  the  components  of  the  Grammar- Driven  Connected  Word  Recognizer. 


Figure  2:  Task  Partitioning 


ponent  is  the  sum  of  individual  word  distance  scores  multiplied 
by  corresponding  word  durations;  the  second  is  a  penalty  for 
overlap  or  underlap  of  adjacent  words;  and  the  third  is  a  silence 
(null  speech)  distance  measure.  An  important  feature  of  the  rec¬ 
ognizer  is  that  the  sentence  hypothe^izer  do  a  not  control  the 
isolated  word  recognizer  by  any  feedback.  This  ensures  that  all 
information  is  preserved  for  late  binding  and  possible  recovery 
from  higher  level  errors. 

The  loading  of  the  similarity  measurements  is  a  fixed  pre- 
dictab'e  function  of  the  vocabulary  size  whereas  the  loading  of 
the  sentence  hypothesizer  is  not  fixed  and  is  a  function  of  the  size 
and  complexity  of  the  grammar  and  the  input  utterance. 

Task  Partitioning  and  Allocation 

In  an  ideal  multiple  processor  environment  one  would  expect 
the  throughput  of  the  system  to  increase  linearly  as  the  number 
of  processors  increases.  However,  this  is  not  always  true  In 
practice,  the  throughput  in  a  multiple  processor  system  increases 
significantly  only  for  the  first  few  additional  processors  and  in  fact 
begins  to  decrease  after  a  certain  number  of  processors  [6].  This 
is  due  to  increased  interprocessor  communication  (IPC).  This 
occurs  when  software  modules,  resident  on  different  processors, 
need  to  communicate  with  each  other.  Communication  protocols, 
management  of  storage,  waiting  time  in  queues  etc.  all  contribute 


to  the  overhead.  This  overhead  grows  rapidly  with  large  num¬ 
bers  of  highly  interacting  processors  and  the  system  throughput 
actually  begins  to  decrease.  This  is  referred  to  as  the  saturation 
effect. 

The  designer  L  now  faced  with  a  dilemma.  In  order  to  exploit 
the  computing  resources  offered  by  the  multiple  processor  system 
he  needs  to  balance  the  load.  But  balancing  the  load  creates 
interprocessor  overhead  which  n  ads  to  bo  kept  as  low  as  possible. 
One  way  to  compromise  the.e  two  conflicting  factors  is  to  allocate 
closely  related  software  modules  to  the  same  processor  and  keep 
the  communication  between  processors  to  a  bare  minimum.  This 
demands  a  thorough  understanding  of  the  algorithm  and  the  flow 
of  data  involved. 

The  first  step  is  to  partition  the  algorithm  into  several  individ¬ 
ual  sub-tasks  or  modules.  For  example,  GDCWR  has  been  split 
up  into  several  subroutines  which  have  been  arbitrarily  named 
A,  B,  C,  etc.  Subroutine  A  calculates  the  11  autocorrelation 
values  from  a  frame  of  digitized  speech  samples,  B  is  a  routine 
that  computes  the  reflection  coeffecients  and  so  on.  These  sub¬ 
routines,  represented  by  circles,  are  shown  in  Figure  2  and  are 
connected  in  accordance  with  the  flow  of  data.  The  number  of 
words  being  passed  from  one  sub-routine  to  another  on  a  per 
frame  basis  represents  the  inter-module  communication  and  has 
been  placed  on  the  connecting  arcs.  This  process  is  known  as 
task  partitioning. 


Once  the  task  partitioning  is  completed,  the  next  step  is  to 
allocate  these  modules  to  different  processors  so  that  the  system 
throughput  is  maximized.  This  is  known  as  task  allocation.  It 
is  during  this  phase  of  the  design  that  one  has  to  halance  the 
two  conflicting  factors  of  load  distribution  and  minimum  inter¬ 
processor  communication.  To  maximize  throughput,  the  individ¬ 
ual  processors  should  be  ahle  to  run  autonomously  to  the  extent 
possible. 

The  first  step  in  task  allocation  is  to  identify  those  routines 
that  are  closely  related  and/or  communicate  with  one  another 
extensively.  In  Figure  2  routines  A,  B,  C,  E  ,  F  and  I  are  closely 
related  and  therefore  fused  together  to  form  a  higger  module 
called  the  Preprocessor  which  is  allocated  to  one  processor.  It 
was  found  that  H  contributes  to  more  than  50%  of  the  loading 
and  limits  the  vocahulary  size.  An  entire  processor  must  there¬ 
fore  he  devoted  to  doing  H.  However,  there  is  considerable  traffic 
between  H  and  P  and  interprocessor  communication  would  be  in¬ 
creased  if  these  routines  were  resident  on  different  processors.  H 
and  P  are  therefore  fused  together  to  form  a  bigger  module  called 
the  Word  Hypothesizer  and  allocated  to  the  another  processor. 
U  is  a  routine  that  could  be  allocated  to  the  Word  Hypothesizer 
or  the  Preproc  jsor,  but  since  we  wish  to  allocate  as  much  CPU 
time  to  the  Word  Hypothesizer  ad  possible  to  do  the  similarity 
measurements,  U  is  allocated  to  the  Preprocessor.  The  remain¬ 
ing  routines  G  and  S  comprise  the  Sentence  Recognizer  and  are 
allocated  to  another  processor.  This  completes  the  task  alloca¬ 
tion  of  the  CWR  software.  The  basic  recognition  system  there¬ 
fore  requires  three  processors  viz.,  the  Preprocessor,  the  Word 
Hypothesizer  and  the  Sentence  Hypothesizer.  The  parallelism 
offered  by  a  multiproces  or  architecture  can  now  be  utilized  to 
increase  the  active  vocabulary  size  by  the  concurrent  execution 
of  the  Word  Hypothesizer  on  two  or  more  processors  with  each 
processor  addressing  a  smaller  sukiet  of  the  vocabulary. 

Figure  3  shows  the  allocation  of  tasks  to  different  processors 
on  one  Odyssey  board.  Processor  0  is  the  Preprocessor,  Proces¬ 
sors  1  and  2  are  the  Word  Hypothesizers  and  Processor  3  is  the 
Sentence  Hypothesizer.  Note  that  all  word  hypothesizers  operate 
on  the  same  data  from  the  preprocessor,  and  communicate  with 
a  single  sentence  hypothesizer.  A  single  Odyssey  is  capable  of 
recognizing  about  100  words.  Each  additional  board  is  capable 
of  addressing  200  words  each. 

In  designing  real-time  systems  one  tends  to  optimize  the  en¬ 
tire  software.  Optimization  of  real-time  software,  though  desir¬ 
able.  may  not  necessarily  be  practical.  The  resulting  increase  in 
processing  efficiency  does  not  justify  the  effort  required  to  opti¬ 
mize  all  the  code.  It  is  often  found  that  there  are  only  a  few 
sections  of  code  where  a  large  percentage  of  the  total  processing 
time  is  spent.  Hence,  efforts  should  he  directed  towards  optimiz¬ 
ing  only  these  small  sections  of  the  code.  For  example,  in  the 
Word  Hypothesizer  module,  it  waa  found  that  90  %  of  the  time 
was  spent  in  the  distance  measuring  routine  that  compared  the 
input  speech  with  the  stored  referenc  ».  Consequently  only  this 
module  was  optimized. 

Multi-processor  software  should  be  designed  so  that  it  can  be 
er  debugged.  As  with  most  computer  systems  the  design  and 
specification  of  a  multiprocessor  system  is  done  top-down  and 
debugged  bottom-up.  Thus  it  is  important  that  one  he  able  to 
debug  the  module  associated  with  each  processor  individually.  In 
the  GDCWR  implementation,  each  processor  module  is  designed 
to  communicate  via  I/O  buffers.  During  the  debug  process,  the 
input  huffer  is  filled  with  canned  data  and  the  processor  is  made 


to  execute  its  function.  The  output  buffer  can  then  be  examined 
for  correctness.  Using  this  technique  each  processor  module  «-»n 
be  tssted  prior  to  integration  of  the  entire  application. 

Performance  Testing 

The  performance  of  a  connected  word  recognizer  is  extremely 
difficult  to  quantify  because  of  the  lack  of  accepted  data  hsse 
and  measurement  standards.  However,  Texas  Instruments  has 
done  a  limited  amount  of  testing  on  this  algorithm  using  an  in¬ 
ternally  developed  connected  digit  data  base.  The  data  used  to 
test  the  algorithm  consisted  of  20  speakers  reading  5-digit  strings. 
A  total  of  2000  strings  w  tc  tested.  Two  application  scenarios 
are  of  interest  -  those  applications  where  the  length  of  the  digit 
sequence  is  unknown  and  those  (like  telephone  numbers  for  ex¬ 
ample)  where  the  length  of  the  sequence  is  known.  The  results 
of  the  test  are  summarized  below  : 

UNKNOWN  LENGTH  :  5.2%  sentence  error  rate 
1.1%  word  error  rate 

KNOWN  LENGTH  :  3.4%  sentence  error  rate 

0.7%  word  error  rate. 

Note  that  the  word  error  rate  for  digit  strings  of  known  length 
approaches  that  achieved  for  the  heat  isolated  word  systems. 

Conclusions 

We  have  presented  the  issues  involved  in  partitioning  and  allo¬ 
cating  tasks  in  a  multiprocessor  environment  and  have  disea  sed 
in  detail  the  implementation  of  a  connected  word  recognizer  on 
the  Odys_ey /Explorer  system.  Each  word  hypothesizer  is  capa¬ 
ble  of  addressing  about  50  words  providing  a  100  word  capability 
for  the  first  Odyssey  board  and  200  words  for  each  additional 
board. 
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